Outline of Eco 251-Descriptive Stats

251descr 8/16/01 (Open this document in 'Outline' view!)

ECONOMICS 251 COURSE OUTLINE

A. Introduction 1. Definitions Define Statistics, Descriptive and Analytic Statistics, Induction and Deduction. 2. Uses of Statistics

B. Sources and Types of Data 1. Data Define data sets, observation, unit of observation. Qualitative and quantitative data. Nominal, ordinal, interval and ratio data. Discrete vs continuous data. a. Qualitative Data (i) Nominal Data: There is no natural number scale - numbers are only used to define categories, so that no operations like addition or multiplication are valid. (ii) Ordinal Data: Numbers are used only to order things (e.g. first, second, first). Differences between ranks do not always have the same meaning. Most mathematical operations are still not valid. b. Quantitative Data (i) Interval Data: Differences between ranks have consistent meaning, but, like Celsius temperature, there is no obvious origin, so that , although addition and subtraction can be used, multiplication and division have no real meaning. (ii) Ratio Data: there is a meaningful origin, so that multiplication and division are valid.

2. Sources Define primary and secondary sources, internal and external data. 3. Cross Section and Time Series Data a. Cross Section Data b. Time Series Data. i. Indices ii. Real Values iii. Rates of change iv Logarithms

C. Presentation of Data 1. Classification Define collectively exhaustive and mutually exclusive classes. These are not the same thing. Collectively exhaustive means that every item you are considering has a place in a class. Mutually exclusive means that if an item belongs in any given class, it does not belong in another class as well. 2. Tables Define parts of tables. See 251pttbl . 3. Charts and Graphs Define parts of graphs

2 D. Frequency Distributions and Populations. 1. Definitions Meaning of Population, Frame, Census, Sample, Grouped Data, Frequency, Example of Frequency Distribution, Relative Frequency. Width of a class interval. largest  smallest w  (Always round this result up!) number of classes 2. Graphs of the Frequency Distribution. a. The Histogram b. The Frequency Polygon c. The Cumulative Frequency Distribution (Ogive). d. Relative Frequencies. e. Smoothed Histograms

E. Sampling and Descriptive Statistics.

1. Sampling to Learn About a Population. Infinite and finite populations, target and sampled populations, the Stability of Mass Data. 2. The Meaning of Random Sampling. A simple random sample of n items taken from a population of N items must be selected in such a way that all combinations of n items are equally likely. 3. Descriptive Statistics. a. Measures of Central Tendency. (Where's the middle of the data?) b. Measures of Dispersion. (How spread out are the data?) c. Measures of Asymmetry etc. (What else can I say about the shape?)

F. Measures of Central Tendency. 1. The Arithmetic Mean of Ungrouped Data. a. The Population Mean. x    N b. The Sample Mean. x x   n

3 2. The Arithmetic Mean of Grouped Data. For grouped data generally substitute  f for  . For x substitute the midpoint of the group. This is defined for our purposes as the arithmetic mean of the lower limit of the group in question and the lower limit of the next group. In other words if we have the group 10 to 10.99, followed by 11 to 11.99 the midpoint of the first group is 10.50, not 10.495.

3. The Weighted Arithmetic Mean.  wx  wx   , x   w  w

4. The Median of Ungrouped Data. Defined simply as the middle point when the data is in order. If there are two middle points, take their arithmetic mean. In continuous data half the points will be above or below the median.

5. The Median of Grouped Data.   1 pn  F position  2 n 1. x1 p  L p  w . See formula for fractiles below and remember that the  f p  median is the .5 fractile.

6. The Mode Simply the most common point, not very useful in discrete ungrouped data. For grouped data it is defined as the midpoint of the largest group.

4 7. Other Means. a. The Geometric Mean. 1 1 x  x  x  x ⋯⋯ x n  n x or lnx  ln(x) g 1 2 3 n  g n  b. The Harmonic Mean. 1 1 1   xh n x c. The Root-Mean-Square. 1 1 x  x 2 or x 2  x 2 rms n  rms n  d. What Formulas for Means Have in Common. 1 f x  f x n  8. Measures of Position. Percentiles, deciles, quintiles, quartiles and fractiles. The two formulas below are two-step formulas. The first step is multiplying n 1 (or N 1 ) by p . p represents the fractile of the data wanted. For example, if we want the 91st percentile, p is .91. Note that the number you have found is called x1 p  x1.91  x.09 (i.e. 9% from the top!). If we want the third p 3 p 1 quartile, Q3  x.25 , is 4 or 0.75. If we want the first quartile, Q1  x.75 , is 4 or 0.25. Of course, for the median p  .5 . N or n represents the number of items in the population or sample, not the number of groups.

a. Finding a Fractile of Grouped Data. To use this formula, we must first compute the cumulative distribution of the group and determine in which group the desired fractile is located with the calculation position  pn 1 . Once we have found the group that this is in, let f p be the frequency of the chosen group, and let F be the cumulative frequency  pn  F  up to but not including the chosen group. The formula here is x1 p  L p   w . In this formula,  f p  w is the class interval (the interval between the lower limit of the chosen group and the lower limit of the next group) and L p is the lower limit of the chosen group. Suppose that in the example below we must find the first quartile. Since the first quartile is the .25 fractile, p is .25. To locate the group use position  pn 1 = 0.25(16)=4 . Profit Rate f F Using the cumulative distribution F  9-10.99% 3 3 column, we find the fourth item in the sample. 11-12.99% 3 6 Since 4 is above 3 and below 6 in the F column, 13-14.99% 5 11 we pick the group 11-12.99%. n is 15-16.99% 3 14 15, and for the group we have picked, w = 17-18.99% 1 15 13 - 11 = 2, Lp  11 , F = 3, and f p  3 . Total 15 If we put these numbers into the formula, .2515  3 we find that x1.25  x.75  11  2  11.5 .  3 

5  pn  F  Note: Sometimes   is negative. In this case choose the group before the one you would ordinarily  f p  have chosen. Example: If you want the 19th percentile of the data above position  pn 1 =.19(16) =  pn  F  .1915 3 3.04, which would normally take us into 11-12.99. But     0.075, so use the group  f p  3 9-10.99 instead. But see c below.

b. Finding a Fractile of Ungrouped Data. This time when we compute position  pn 1 , we divide it into an integer part, a , and a fractional part, .b . For example, if n = 10, and we wish to find the first quartile,

p = 0.25, so that pn 1 = 0.25 (11) = 2.75. Then a  2 , and .b  .75 . Now find xa and xa1 ,

in this case x2 and x3 , and use the formula x1 p  xa  .bxa1  xa  . For example, if our sample

consists of 10 numbers, 1,5,7,9,9,11,13,14,17,19, xa  x2  5 and xa1  x3  7 , so that

x1 p  x.75  5  0.757  5  6.5 c. Experimental formula (Don't read this!) Because of problems with the grouped data formula above, I intend to experiment with a new pair of formulas. position  1 pn 1  a.b (the position formula can be used with both grouped and ungrouped  pn 1 0.5  F  data ) and x1 p  L p   w .  f p  Example: Using the data in 8a n  15 First quartile: position  1 pn 1  1 .25(14)  4.5 . This is in group 11-12.99. .2514 0.5  3 x1.25  x.75  11  2  11.67 .  3  Median: position  1 pn 1  1 .5(14)  8 (Same as with the old formula) This is in group 13-14.99. .514 0.5  6 x1.5  x.5  13   2  13.6 .  5  Third quartile: position  1 pn 1  1 .75(14)  11.5 This is in group 15-16.99. .7514 0.5 11 x1.75  x.25  15   2  15  3  Seventy-fourth percentile: position  1 pn 1  1 .74(14)  11.36 This is in group 13-14.99. Why? For

13-14.99, F  11 . This means that numbers up to x11 are in 13-14.99 or lower groups and that x12 and numbers above it are in 15-16.99 and higher groups. Thus we set the boundary at 11.5. .7414 0.5  6 x1.75  x.25  13   2  14.94  5  Nineteenth percentile: position 1  pn 1 1  .19(14)  3.66 . This is in group 11-12.99. .1914 0.5  3 x1.19  x.81  11  2  11.11  3 

6 G. Measures of Dispersion and Asymmetry. 1. Range Range  highest number  lowest number or highest midpoint  lowest midpoint . Interquartile Range: IQR  Q3  Q1 . 2. The Variance and Standard Deviation of Ungrouped Data.

a. The Population Variance - Definitional and Computational Formulas. x  2 x 2 Definitional  2   Computational  2     2 N N Standard Deviation = variance b. The Sample Variance. x  x2 x 2  nx 2 Definitional s 2   Computational s 2   n 1 n 1 The computational formula is one of the most important formulas you will learn. Note that  x 2 is not 2 2 2 2 2 the same as  x . For example, if x is 2,3,5,  x  2  3  5  4  9  25  38 , not 2  3 52  10 2  100 . Example: Use x  2,3,5 Computational Method Definitional Method x x 2 x x  x x  x2 2 4 2 -1.33333 1.77778 3 9 3 -0.33333 0.11111 5 25 5 1.66667 2.77778 10 38 10 0.00001 4.66667 x 2  10 2 From this we find x  10, x  38, x    3.33333 and  x  x  4.66667 Note that   n 3  x  x should be zero, but is not because of rounding. Now, if we use the computational method, 2 2 x  nx 38  33.333332 4.6667 we can use s 2      2.3333 (Some texts prefer n 1 31 2 1 2 1 x 2   x 38  102   4.66666667 which give us a little more accuracy for a s 2  n  3   2.33333 n 1 31 2 2 x  x 4.66667 little more work.) If we use the definitional method s 2     2.33333 , but note that n 1 2 we had to do three subtractions instead of 1.

c. The Coefficient of Variation. std.deviation C  mean

7 d. Chebyshef’s Inequality and the Empirical Rule 1 Chebyshef Inequality: Px    k  k 2 Empirical rule: (For Symmetrical Unimodal distributions only) 68% within one standard distribution of the mean, 95% within two and almost all within three.

3. The Variance and Standard Deviation of Grouped Data. For grouped data generally substitute  f for  . 4. Skewness and Kurtosis. Population skewness, the 3rd k-statistic, coefficients of skewness; population kurtosis, the 4th k-statistic, the coefficient of excess; leptokurtic, platykurtic and mesokurtic distributions. The usual measurement of skewness is often called the third moment about the mean . (The population variance is the second). The formula for population skewness is: x  3    . 3 N

n 3 The corresponding sample statistic is the third k-statistic, k  x  x . The 3 n 1n  2  corresponding computational formulas are

1 3 2 n 3 2 3   x  3 x  2N3 and k3   x  3x x  2nx . To make grouped data 3 N  n 1n  2   formulas, put an f to the right of the  sign. Positive values of these formulas imply skewness to the right, negative values to the left. Note that multiplying all the values of x by two would multiply the values of these coefficients by eight, but would not change the shape of the distribution. If we want to compare shapes, we need measurements that will not change if we multiply all values by a constant. Such

3 k3 a measure would be called the coefficient of relative skewness, with the formulas  1  and g1  .  3 s 3

Note that for the Normal distribution  3  0 . Another measure of skewness is Pearson's measure of 3mean  mode skewness, SK  ; the median is sometimes used instead of the mode in this formula. std.deviation

8 Example: Profit Rate f x (midpoint) fx fx 2 fx3 9-10.99 3 10 30 300 3000

11-12.99 3 12 36 432 5184

13-14.99 5 14 70 980 13720

15-16.99 3 16 48 768 12288

17-18.99 1 18 18 324 5832

Total 15 202 2804 40024

2 3  fx 202 So  f  n  15 ,  fx  202 ,  fx  2804 ,  fx  40024 , so that x    13.467 n 15 2 2 fx  nx 2804 1513.4672 82.733 and s 2      5.981 , which means s  5.981  2.446 . n 1 15 1 14 s C   2.446  0.182 . x 13.467

To measure skewness, use one of the following three results.

n 3 2 3 15 2 k3   fx  3x fx  2nx  40024  313.4672804 21513.467  n 1n  2   1413

158.249 k3 0.680  = 0.680, or Relative Skewness g1    .046 or (14)(13) s 3 2.4463 3mean  mode 313.467 14 Pearson's Measure of Skewness SK    0.163 . Note that, in this case, std.deviation 2.446 Pearson's Measure and Relative Skewness contradict each other as to the direction of skewness.

The measures of kurtosis are, for populations, x  4  1 4 3 2 2 4 and, for samples,  4    x  4 x  6 x  3  N N    2  x  x4 3 4  n   3n 1 s  k 4  n 1  . k4 can be considered an estimate of n 1n  2n  3  n n 2   

4  4 k 4   3 . To get a measurement of shape use the coefficient of excess  2   3 or g 2  . Since 4  4 s 4 4 the Normal distribution has  4  3 , the coefficient of excess is zero for the Normal distribution. Kurtosis has traditionally been considered a measure of the peakedness of a distribution relative to the Normal distribution, though there are some exceptions to this interpretation. If the coefficient of excess is positive, we may call a distribution leptokurtic or sharp-peaked. If the coefficient of excess is negative, the distribution can be called platykurtic or flat-peaked. If the coefficient of excess is close to zero, we call the distribution mesokurtic, middle-peaked. A symmetric, mesokurtic distribution is essentially Normal.

9 Example (using definitional formulas): x 3 4 Profit Rate f fx x  x f x  x f x  x 2 f x  x f x  x midpoint       9-10.99 3 10 30 -3.467 -10.400 36.053 -124.985 433.323

11-12.99 3 12 36 -1.467 -4.400 6.453 -9.465 13.885

13-14.99 5 14 70 0.533 2.667 1.422 0.759 1.079

15-16.99 3 16 48 2.533 7.600 19.253 48.775 123.457

17-18.99 1 18 18 4.533 4.533 20.551 93.164 422.317

Total 15 202 0.000 83.732 8.249 944.466

2 So  f  n  15 ,  fx  202 ,  f  x  x  0 ,  f  x  x  83.732 , 3 4  fx 202  f  x  x  8.249 and  f  x  x  944.466 , so that x    13.467 and n 15 2 f  x  x s 2.446 2  82.733 , which means s  5.981  2.446 . C    0.182 . s    5.981 13.467 n 1 14 x

3 n f  x  x 158.249 To measure skewness, use one of the following three results. k    = 3 (n 1)(n  2) 1413

k3 0.680 0.680, or Relative Skewness g1   .046 or s3 2.446 3 3mean  mode 313.467 14 Pearson's Measure of Skewness SK    0.163 . Note that, in this case, std.deviation 2.446 Pearson's Measure and Relative Skewness contradict each other as to the direction of skewness. 2  f x  x4 3 4  n   3n 1 s  k 4  n 1  n 1n  2n  3  n n 2   

k4 31.0337 =-31.0337. So g2    0.868 . The negative sign implies that the distribution is s4 5.981 2 platykurtic.

5. Review a. Grouped Data. b. Ungrouped Data.

10 Appendix: Explanation of Sample Formulas (Not for student consumption) 1. The Sample Variance. 2 1 2 1 2 2 The Sample Variance is defined as s  x  x   x  nx . If s 2 has an expected n 1  n 1  2 2 2 2 value of  2 it must be true that E x  x   E x  nx  n 1 . We can assume, without loss of generality that   Ex  0. Under these conditions, the Variance is defined as  2  Ex  2  Ex 2 

2 2 2 1 2 1 . Thus E x  n . An expression like x   x has terms like x1 x2 . Because of the  n 2  n 2 independence assumption on the sample, all these terms have expected values of zero except for terms with  1 2  1 2 1 1 two identical subscripts and Ex 2  E  x   E x  n 2   2 . Thus  n 2   n 2  n 2 n  1  E x 2  nx 2  E x 2  nEx 2   n 2  n  2  n 1 2 .    n  2. The Third k Statistic.

n 3 n 3 2 3 If the third k statistic k3  x  x   x  3x x  2nx  n 1n  2  n 1n  2   3 If k3 has an expected value of 3 , it must be true that E x  x  3 2 3 n 1n  2  E x  3x x  2nx  3 .   n We can assume, without loss of generality that   Ex  0. Under these conditions, the skewness is 3 3 3 3 3 1 defined as   Ex    Ex . Thus E x  n3 . An expression like x   x has terms 3  n3  1 like x1x2 x3 . Because of the independence assumption on the sample, all these terms have expected n3 values of zero except for terms with three identical subscripts and

3  1 3  1 3 1 1 Ex  E  x   E x  n3  3 . By the same reasoning Ex x 3 . Thus  n3   n3  n3 n 2 

3 2 3 3 2 3 1 E x  3x x  2nx  E x  3Ex x  2nEx   n3  33  2n 3     n 2  2  n 2  3n  2 n 1n  2  n  3  3  3  3 .  n  n n 3. And now, for considerable extra credit, what can you say about the expected value of k 4 ?

11