Purposes of Data Analysis Parameters and Statistics Variables and Samples
Total Page:16
File Type:pdf, Size:1020Kb
Purposes of Data Analysis Part 1: Probability Distributions True Distributions or Relationships in the Earths System Probability Distribution Normal Distribution Organizing Data Student-t Distribution Sampling Chi Square Distribution Find Relationships among Data F Distribution Weather Forecasts, Significance Tests Test Significance of the Results Physical Understanding, …. ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu Parameters and Statistics Variables and Samples Random Variable: A variable whose values occur at Parameters: Numbers that describe a population. For random, following a probability distribution. example, the population mean (µ)and standard deviation (σ). Observation: When the random variable actually attains a value, that value is called an observation (of the Statistics: Numbers that are calculated from a sample. variable). A given population has only one value of a particular parameter, but a particular statistic calculated from different Sample: A collection of several observations is called samples of the population has values that are generally sample. If the observations are generated in a random different, both from each other, and from the parameter that fashion with no bias, that sample is known as a random the statistics is designed to estimate. sample. The science of statistics is concerned with how to draw By observing the distribution of values in a random sample, reliable, quantifiable inferences from statistics about we can draw conclusions about the underlying probability parameters. distribution. (from Middleton 2000) ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu 1 Probability Distribution Example continuous probability distribution discrete probability distribution If you throw a die, there are six possible outcomes: the PDF numbers 1, 2, 3, 4, 5 or 6. This is an example of a random variable (the dice value), a variable whose possible values occur at random. When the random variable actually attains a value (such as when the dice is actually thrown) that value is called an The pattern of probabilities for a set of events is called a observation. probability distribution. If you throw the die 10 times, then you have a random (1) The probability of each event or combinations of events must range from sample which consists of 10 observations. 0 to 1. (2) The sum of the probability of all possible events must be equal too 1. ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu Probability Density Function Cumulative Distribution Function P = the probability that a randomly selected value of a variable X falls between a and b. The cumulative distribution function F(x) is defined as the probability f(x) = the probability density function. that a variable assumes a value less than x. The probability function has to be integrated over distinct limits to obtain a probability. The cumulative distribution function is often used to assist in The probability for X to have a particular value is ZERO. calculating probability (will show later). Two important properties of the probability density function: The following relation between F and P is essential for probability (1) f(x) ≥ 0 for all x within the domain of f. calculation: (2) ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu 2 Normal Distribution Standard Normal Distribution The standard normal distribution has a mean of 0 and a standard f: probability density function mean deviation of 1. µ: mean of the population This probability distribution is particularly useful as it can represent σ: standard deviation of the population any normal distribution, whatever its mean and standard deviation. The normal distribution is one of the most important distribution in Using the following transformation, a normal distribution of variable X geophysics. Most geophysical variables (such as wind, temperature, can be converted to the standard normal distribution of variable Z: pressure, etc.) are distributed normally about their means. Z = ( X - µ ) / σ ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu Transformations How to Use Standard Normal Distribution It can be shown that any frequency function can be transformed in to a frequency function of given form by a suitable transformation or functional relationship. For example, the original data follows some complicated skewed distribution, we may want to transform this distribution into a known distribution (such as the normal distribution) whose theory and property are well known. Most geoscience variables are distributed normally about their mean or can be transformed in such a way that they become normally distributed. Example 1: What is the probability that a value of Z is greater than 0.75? The normal distribution is, therefore, one of the most important distribution in geoscience data analysis. Answer: P(Z≥0.75) = 1-P(Z≤0.75)=1-0.7734=0.2266 ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu 3 Another Example Example 1 Given a normal distribution with µ = 50 and σ = 10, find the probability that X assumes a value between 45 and 62. negative value The Z values corresponding to x1 = 45 and x2 = 62 are Example 2: What is the probability that Z lies between the limits Z1=-0.60 and Z2=0.75? Answer: Therefore, P(45 < X < 62) = P(-0.5 < Z < 1.2) P(Z ≤-0.60) = 1 − P(Z < 0.60) Î P(Z >-0.60)=1−0.7257=0.2743 = P(Z <1.2) – P(Z<-0.5) P(-0.60 ≤Z ≤0.75) = P(Z ≤0.75) − P(Z ≤-0.60) = 0.8849 – 0.3085 = 0.7734 − 0.2743 = 0.5764 ESS210B ESS210B = 0.4991 Prof. Jin-Yi Yu Prof. Jin-Yi Yu Example 2 Probability of Normal Distribution An electrical firm manufactures light bulbs that have a length of life that is normally distributed with mean equal to 800 hours and a standard deviation of 40 hours. Find the possibility that a bulb burns between 778 and 834 hours. The Z values corresponding to x1 = 778 and x2 = 834 are z1 = (778 – 800) / 40 = -0.55 z2 = (834 – 800) / 40 =0.85 Therefore, P(778 < X < 834) = P(-0.55 < Z < 0.85) = P(Z <0.85) – P(Z<-0.55) = 0.8849 – 0.2912 = 0.5111 ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu 4 Probability of Normal Distribution Application in Operational Climatology 0.53 -1.28 -0.53 Warmer than 9 out of 10 years 1.28 There is only a 4.55% probability that a normally distributed variable will fall more than 2 standard deviations away from its mean. This figure shows average temperatures for January 1989 over the US, This is the two-tailed probability. The probability that a normal variable expressed as quantiles of the local Normal distributions. will exceed its mean by more then 2 σ is only half of that, 2.275%. Different values of µ and σ have been estimated for each location. ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu How to Estimate µ and σ Sampling Distribution In order to use the normal distribution, we need to know Is the sample mean close to the population mean? the mean and standard deviation of the population To answer this question, we need to know the probability ÎBut they are impossible to know in most geoscience distribution of the sample mean. applications, because most geoscience populations are infinite. We can obtain the distribution by repeatedly drawing samples from a population and find out the frequency ÎWe have to estimate µ and σ from samples. distribution. and ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu 5 Example Distribution of Sample Average If we repeatedly draw N observations from a standard normal distribution (µ=0 and σ=1) and calculate the sample mean, what is the distribution of the sample mean? What did we learn from this example? ÎIf a sample is composed of N random observations from a normal distribution with mean µ and standard deviation σ, then the distribution of the sample average will be a normal distribution with mean µ but standard deviation σ/sqrt(N). ESS210B ESS210B (from Berk & Carey’s book) Prof. Jin-Yi Yu Prof. Jin-Yi Yu Standard Error Distribution of Sample Means One example of how the normal distribution can be used The standard deviation of the probability distribution of for data that are “non-normally” distributed is to determine the sample mean (X) is also referred as the “standard the distribution of sample means. error” of X. Is the sample mean close to the population mean? To answer this question, we need to know the probability distribution of the sample mean. We can obtain the distribution by repeatedly drawing samples from a population and find out the frequency distribution. ESS210B ESS210B Prof. Jin-Yi Yu Prof. Jin-Yi Yu 6 An Example of Normal Distribution Central Limit Theory 1000 random numbers =2 N N =1 0 In the limit, as the sample size becomes large, the sum (or the mean) of a set of independent measurements will have a normal distribution, non-normal distribution irrespective of the distribution of the raw data. If there is a population which has a mean of µand a standard deviation If we repeatedly draw a sample of N values from a of σ, then the distribution of its sampling means will have: population and calculate the mean of that sample, the µ µ (1) a mean x = , and distribution of sample means tends to be in normal σ σ 1/2 (2) a standard deviation x = /(N) = standard error of the mean distribution. ESS210B ESS210B (Figure from Panofsky and Brier 1968) Prof. Jin-Yi Yu Prof. Jin-Yi Yu Sample Size Normal Distribution of Sample Means The distribution of sample means can be transformed into the standard normal distribution with the following (S-σ)/σ * 100 transformation: N=30 The bigger the sample size N, the more accurate the estimated mean and standard deviation are to the true values.