Probability Distributionsdistributions
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Lecture 22: Bivariate Normal Distribution Distribution
6.5 Conditional Distributions General Bivariate Normal Let Z1; Z2 ∼ N (0; 1), which we will use to build a general bivariate normal Lecture 22: Bivariate Normal Distribution distribution. 1 1 2 2 f (z1; z2) = exp − (z1 + z2 ) Statistics 104 2π 2 We want to transform these unit normal distributions to have the follow Colin Rundel arbitrary parameters: µX ; µY ; σX ; σY ; ρ April 11, 2012 X = σX Z1 + µX p 2 Y = σY [ρZ1 + 1 − ρ Z2] + µY Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 1 / 22 6.5 Conditional Distributions 6.5 Conditional Distributions General Bivariate Normal - Marginals General Bivariate Normal - Cov/Corr First, lets examine the marginal distributions of X and Y , Second, we can find Cov(X ; Y ) and ρ(X ; Y ) Cov(X ; Y ) = E [(X − E(X ))(Y − E(Y ))] X = σX Z1 + µX h p i = E (σ Z + µ − µ )(σ [ρZ + 1 − ρ2Z ] + µ − µ ) = σX N (0; 1) + µX X 1 X X Y 1 2 Y Y 2 h p 2 i = N (µX ; σX ) = E (σX Z1)(σY [ρZ1 + 1 − ρ Z2]) h 2 p 2 i = σX σY E ρZ1 + 1 − ρ Z1Z2 p 2 2 Y = σY [ρZ1 + 1 − ρ Z2] + µY = σX σY ρE[Z1 ] p 2 = σX σY ρ = σY [ρN (0; 1) + 1 − ρ N (0; 1)] + µY = σ [N (0; ρ2) + N (0; 1 − ρ2)] + µ Y Y Cov(X ; Y ) ρ(X ; Y ) = = ρ = σY N (0; 1) + µY σX σY 2 = N (µY ; σY ) Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 2 / 22 Statistics 104 (Colin Rundel) Lecture 22 April 11, 2012 3 / 22 6.5 Conditional Distributions 6.5 Conditional Distributions General Bivariate Normal - RNG Multivariate Change of Variables Consequently, if we want to generate a Bivariate Normal random variable Let X1;:::; Xn have a continuous joint distribution with pdf f defined of S. -
Applied Biostatistics Mean and Standard Deviation the Mean the Median Is Not the Only Measure of Central Value for a Distribution
Health Sciences M.Sc. Programme Applied Biostatistics Mean and Standard Deviation The mean The median is not the only measure of central value for a distribution. Another is the arithmetic mean or average, usually referred to simply as the mean. This is found by taking the sum of the observations and dividing by their number. The mean is often denoted by a little bar over the symbol for the variable, e.g. x . The sample mean has much nicer mathematical properties than the median and is thus more useful for the comparison methods described later. The median is a very useful descriptive statistic, but not much used for other purposes. Median, mean and skewness The sum of the 57 FEV1s is 231.51 and hence the mean is 231.51/57 = 4.06. This is very close to the median, 4.1, so the median is within 1% of the mean. This is not so for the triglyceride data. The median triglyceride is 0.46 but the mean is 0.51, which is higher. The median is 10% away from the mean. If the distribution is symmetrical the sample mean and median will be about the same, but in a skew distribution they will not. If the distribution is skew to the right, as for serum triglyceride, the mean will be greater, if it is skew to the left the median will be greater. This is because the values in the tails affect the mean but not the median. Figure 1 shows the positions of the mean and median on the histogram of triglyceride. -
Startips …A Resource for Survey Researchers
Subscribe Archives StarTips …a resource for survey researchers Share this article: How to Interpret Standard Deviation and Standard Error in Survey Research Standard Deviation and Standard Error are perhaps the two least understood statistics commonly shown in data tables. The following article is intended to explain their meaning and provide additional insight on how they are used in data analysis. Both statistics are typically shown with the mean of a variable, and in a sense, they both speak about the mean. They are often referred to as the "standard deviation of the mean" and the "standard error of the mean." However, they are not interchangeable and represent very different concepts. Standard Deviation Standard Deviation (often abbreviated as "Std Dev" or "SD") provides an indication of how far the individual responses to a question vary or "deviate" from the mean. SD tells the researcher how spread out the responses are -- are they concentrated around the mean, or scattered far & wide? Did all of your respondents rate your product in the middle of your scale, or did some love it and some hate it? Let's say you've asked respondents to rate your product on a series of attributes on a 5-point scale. The mean for a group of ten respondents (labeled 'A' through 'J' below) for "good value for the money" was 3.2 with a SD of 0.4 and the mean for "product reliability" was 3.4 with a SD of 2.1. At first glance (looking at the means only) it would seem that reliability was rated higher than value. -
1. How Different Is the T Distribution from the Normal?
Statistics 101–106 Lecture 7 (20 October 98) c David Pollard Page 1 Read M&M §7.1 and §7.2, ignoring starred parts. Reread M&M §3.2. The eects of estimated variances on normal approximations. t-distributions. Comparison of two means: pooling of estimates of variances, or paired observations. In Lecture 6, when discussing comparison of two Binomial proportions, I was content to estimate unknown variances when calculating statistics that were to be treated as approximately normally distributed. You might have worried about the effect of variability of the estimate. W. S. Gosset (“Student”) considered a similar problem in a very famous 1908 paper, where the role of Student’s t-distribution was first recognized. Gosset discovered that the effect of estimated variances could be described exactly in a simplified problem where n independent observations X1,...,Xn are taken from (, ) = ( + ...+ )/ a normal√ distribution, N . The sample mean, X X1 Xn n has a N(, / n) distribution. The random variable X Z = √ / n 2 2 Phas a standard normal distribution. If we estimate by the sample variance, s = ( )2/( ) i Xi X n 1 , then the resulting statistic, X T = √ s/ n no longer has a normal distribution. It has a t-distribution on n 1 degrees of freedom. Remark. I have written T , instead of the t used by M&M page 505. I find it causes confusion that t refers to both the name of the statistic and the name of its distribution. As you will soon see, the estimation of the variance has the effect of spreading out the distribution a little beyond what it would be if were used. -
Appendix F.1 SAWG SPC Appendices
Appendix F.1 - SAWG SPC Appendices 8-8-06 Page 1 of 36 Appendix 1: Control Charts for Variables Data – classical Shewhart control chart: When plate counts provide estimates of large levels of organisms, the estimated levels (cfu/ml or cfu/g) can be considered as variables data and the classical control chart procedures can be used. Here it is assumed that the probability of a non-detect is virtually zero. For these types of microbiological data, a log base 10 transformation is used to remove the correlation between means and variances that have been observed often for these types of data and to make the distribution of the output variable used for tracking the process more symmetric than the measured count data1. There are several control charts that may be used to control variables type data. Some of these charts are: the Xi and MR, (Individual and moving range) X and R, (Average and Range), CUSUM, (Cumulative Sum) and X and s, (Average and Standard Deviation). This example includes the Xi and MR charts. The Xi chart just involves plotting the individual results over time. The MR chart involves a slightly more complicated calculation involving taking the difference between the present sample result, Xi and the previous sample result. Xi-1. Thus, the points that are plotted are: MRi = Xi – Xi-1, for values of i = 2, …, n. These charts were chosen to be shown here because they are easy to construct and are common charts used to monitor processes for which control with respect to levels of microbiological organisms is desired. -
1 One Parameter Exponential Families
1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e. -
Problems with OLS Autocorrelation
Problems with OLS Considering : Yi = α + βXi + ui we assume Eui = 0 2 = σ2 = σ2 E ui or var ui Euiuj = 0orcovui,uj = 0 We have seen that we have to make very specific assumptions about ui in order to get OLS estimates with the desirable properties. If these assumptions don’t hold than the OLS estimators are not necessarily BLU. We can respond to such problems by changing specification and/or changing the method of estimation. First we consider the problems that might occur and what they imply. In all of these we are basically looking at the residuals to see if they are random. ● 1. The errors are serially dependent ⇒ autocorrelation/serial correlation. 2. The error variances are not constant ⇒ heteroscedasticity 3. In multivariate analysis two or more of the independent variables are closely correlated ⇒ multicollinearity 4. The function is non-linear 5. There are problems of outliers or extreme values -but what are outliers? 6. There are problems of missing variables ⇒can lead to missing variable bias Of course these problems do not have to come separately, nor are they likely to ● Note that in terms of significance things may look OK and even the R2the regression may not look that bad. ● Really want to be able to identify a misleading regression that you may take seriously when you should not. ● The tests in Microfit cover many of the above concerns, but you should always plot the residuals and look at them. Autocorrelation This implies that taking the time series regression Yt = α + βXt + ut but in this case there is some relation between the error terms across observations. -
Approaching Mean-Variance Efficiency for Large Portfolios
Approaching Mean-Variance Efficiency for Large Portfolios ∗ Mengmeng Ao† Yingying Li‡ Xinghua Zheng§ First Draft: October 6, 2014 This Draft: November 11, 2017 Abstract This paper studies the large dimensional Markowitz optimization problem. Given any risk constraint level, we introduce a new approach for estimating the optimal portfolio, which is developed through a novel unconstrained regression representation of the mean-variance optimization problem, combined with high-dimensional sparse regression methods. Our estimated portfolio, under a mild sparsity assumption, asymptotically achieves mean-variance efficiency and meanwhile effectively controls the risk. To the best of our knowledge, this is the first time that these two goals can be simultaneously achieved for large portfolios. The superior properties of our approach are demonstrated via comprehensive simulation and empirical studies. Keywords: Markowitz optimization; Large portfolio selection; Unconstrained regression, LASSO; Sharpe ratio ∗Research partially supported by the RGC grants GRF16305315, GRF 16502014 and GRF 16518716 of the HKSAR, and The Fundamental Research Funds for the Central Universities 20720171073. †Wang Yanan Institute for Studies in Economics & Department of Finance, School of Economics, Xiamen University, China. [email protected] ‡Hong Kong University of Science and Technology, HKSAR. [email protected] §Hong Kong University of Science and Technology, HKSAR. [email protected] 1 1 INTRODUCTION 1.1 Markowitz Optimization Enigma The groundbreaking mean-variance portfolio theory proposed by Markowitz (1952) contin- ues to play significant roles in research and practice. The optimal mean-variance portfolio has a simple explicit expression1 that only depends on two population characteristics, the mean and the covariance matrix of asset returns. Under the ideal situation when the underlying mean and covariance matrix are known, mean-variance investors can easily compute the optimal portfolio weights based on their preferred level of risk. -
Characteristics and Statistics of Digital Remote Sensing Imagery (1)
Characteristics and statistics of digital remote sensing imagery (1) Digital Images: 1 Digital Image • With raster data structure, each image is treated as an array of values of the pixels. • Image data is organized as rows and columns (or lines and pixels) start from the upper left corner of the image. • Each pixel (picture element) is treated as a separate unite. Statistics of Digital Images Help: • Look at the frequency of occurrence of individual brightness values in the image displayed • View individual pixel brightness values at specific locations or within a geographic area; • Compute univariate descriptive statistics to determine if there are unusual anomalies in the image data; and • Compute multivariate statistics to determine the amount of between-band correlation (e.g., to identify redundancy). 2 Statistics of Digital Images It is necessary to calculate fundamental univariate and multivariate statistics of the multispectral remote sensor data. This involves identification and calculation of – maximum and minimum value –the range, mean, standard deviation – between-band variance-covariance matrix – correlation matrix, and – frequencies of brightness values The results of the above can be used to produce histograms. Such statistics provide information necessary for processing and analyzing remote sensing data. A “population” is an infinite or finite set of elements. A “sample” is a subset of the elements taken from a population used to make inferences about certain characteristics of the population. (e.g., training signatures) 3 Large samples drawn randomly from natural populations usually produce a symmetrical frequency distribution. Most values are clustered around the central value, and the frequency of occurrence declines away from this central point. -
Basic Econometrics / Statistics Statistical Distributions: Normal, T, Chi-Sq, & F
Basic Econometrics / Statistics Statistical Distributions: Normal, T, Chi-Sq, & F Course : Basic Econometrics : HC43 / Statistics B.A. Hons Economics, Semester IV/ Semester III Delhi University Course Instructor: Siddharth Rathore Assistant Professor Economics Department, Gargi College Siddharth Rathore guj75845_appC.qxd 4/16/09 12:41 PM Page 461 APPENDIX C SOME IMPORTANT PROBABILITY DISTRIBUTIONS In Appendix B we noted that a random variable (r.v.) can be described by a few characteristics, or moments, of its probability function (PDF or PMF), such as the expected value and variance. This, however, presumes that we know the PDF of that r.v., which is a tall order since there are all kinds of random variables. In practice, however, some random variables occur so frequently that statisticians have determined their PDFs and documented their properties. For our purpose, we will consider only those PDFs that are of direct interest to us. But keep in mind that there are several other PDFs that statisticians have studied which can be found in any standard statistics textbook. In this appendix we will discuss the following four probability distributions: 1. The normal distribution 2. The t distribution 3. The chi-square (2 ) distribution 4. The F distribution These probability distributions are important in their own right, but for our purposes they are especially important because they help us to find out the probability distributions of estimators (or statistics), such as the sample mean and sample variance. Recall that estimators are random variables. Equipped with that knowledge, we will be able to draw inferences about their true population values. -
The Bootstrap and Jackknife
The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research • Interest often focuses upon the estimation of some unknown parameter, θ. The parameter θ can represent for example, mean weight of a certain strain of mice, heritability index, a genetic component of variation, a mutation rate, etc. • Two key questions need to be addressed: 1. How do we estimate θ ? 2. Given an estimator for θ , how do we estimate its precision/accuracy? • We assume Question 1 can be reasonably well specified by the researcher • Question 2, for our purposes, will be addressed via the estimation of the estimator’s standard error Summer 2017 Summer Institutes 250 What is a standard error? Suppose we want to estimate a parameter theta (eg. the mean/median/squared-log-mode) of a distribution • Our sample is random, so… • Any function of our sample is random, so... • Our estimate, theta-hat, is random! So... • If we collected a new sample, we’d get a new estimate. Same for another sample, and another... So • Our estimate has a distribution! It’s called a sampling distribution! The standard deviation of that distribution is the standard error Summer 2017 Summer Institutes 251 Bootstrap Motivation Challenges • Answering Question 2, even for relatively simple estimators (e.g., ratios and other non-linear functions of estimators) can be quite challenging • Solutions to most estimators are mathematically intractable or too complicated to develop (with or without advanced training in statistical inference) • However • Great strides in computing, particularly in the last 25 years, have made computational intensive calculations feasible. -
Linear Regression
eesc BC 3017 statistics notes 1 LINEAR REGRESSION Systematic var iation in the true value Up to now, wehav e been thinking about measurement as sampling of values from an ensemble of all possible outcomes in order to estimate the true value (which would, according to our previous discussion, be well approximated by the mean of a very large sample). Givenasample of outcomes, we have sometimes checked the hypothesis that it is a random sample from some ensemble of outcomes, by plotting the data points against some other variable, such as ordinal position. Under the hypothesis of random selection, no clear trend should appear.Howev er, the contrary case, where one finds a clear trend, is very important. Aclear trend can be a discovery,rather than a nuisance! Whether it is adiscovery or a nuisance (or both) depends on what one finds out about the reasons underlying the trend. In either case one must be prepared to deal with trends in analyzing data. Figure 2.1 (a) shows a plot of (hypothetical) data in which there is a very clear trend. The yaxis scales concentration of coliform bacteria sampled from rivers in various regions (units are colonies per liter). The x axis is a hypothetical indexofregional urbanization, ranging from 1 to 10. The hypothetical data consist of 6 different measurements at each levelofurbanization. The mean of each set of 6 measurements givesarough estimate of the true value for coliform bacteria concentration for rivers in a region with that urbanization level. The jagged dark line drawn on the graph connects these estimates of true value and makes the trend quite clear: more extensive urbanization is associated with higher true values of bacteria concentration.