Econometrics in Practice
Total Page:16
File Type:pdf, Size:1020Kb
Econometrics in Practice Donggyu Sul 2015 Abstract 1 1 Introduction: Types of Data There are three types of data: Cross sectional data, Time series data, Panel data. 1.0.1 Cross sectional data (i = 1; :::; n) - True cross sectional data: Time invariant. Example; personality, preference, I.Q., blood type. - Pseudo cross sectional data: Time varying. Example; personal income, real estate prices, 1.0.2 Time series data (t = 1; ::; T ) - Stationary data: The variance is not time varying. Example; interest rate, Texas state income growth rates - Nonstationary data: The variance is time varying, and sometimes, the variance is increasing over time. Example; US income level, stock prices, exchange rates. 1.0.3 Panel data Combining with cross and time. - short panel: large n but small T - long panel: small n but large T Depending on a type of the data, the statistics of interest are in general di¤erent. 2 Part I Cross Section Data Parameters of interest: central tendency and shape of distribution. Learn the di¤erence between the true and pseudo cross sectional data Learn how to compare two samples. (independent and dependent) Learn how to model a linear regression Learn the role of missing variables on the regression results 3 2 Central Tendency Mean, Median, Quantile Mean. Example: Female and male income comparison. 2.1 Mean: Statistics Sample mean: 1 n ^ = y n i i=1 X Weighted mean: n n ^we = !iyi; where !i = 1: i=1 i=1 X X Trimmed mean: Discard a small percentage of the largest and smallest values before calculating the mean. Let y1 y2 yn: Then n m 1 ^ = y ; = m=n: trim n 2m i i=m+1 X Winsorized mean: Set y1 = ym+1; ; ym = ym+1 and yn = ym n; yn 1 = ym n; ; ym n+1 = ym n: Then take the mean. Example: n = 50: = 2%: The 2% winsorized mean is original sample: from smallest to largest y1; y2; y3; y4; y5; ; y46; y47; y48; y49; y50 Trimmed mean 2%: 0; y2; y3; y4; y5; ; y46; y47; y48; y49; 0 Winsorized mean 2% y2; y2; y3; y4; y5; ; y46; y47; y48; y49; y49 Googling: Trimmed Mean PCE In‡ation Rate http://www.dallasfed.org/research/pce/ 2.2 Mean: Statistical Inference How to evaluate whether or not the sample mean is meaningful. Or test if the estimated sample mean is di¤erent from a referenced value (for example, zero or unity) Basic idea: Use central limit theorem which is a nice statistical device. 4 Central limit theorem (CLT) 2 1 n d Let ^ = yi: As n ; (^ ) 0; : n n i=1 ! 1 n ! N n P 2.2.1 Interpretation: 1. dthe distribution (d) approaches to ! 2. (0; v): Normal distribution with the variance of v: N 3. Normal distribution: probability density function or Pr(x = c) : (a) Example: (Probability of x = 0) 1 (x )2 exp 2 p2 " 2 # (b) If = 0 and 2 = 1 (standard normal distribution), then Pr(x = 0) becomes 1 (0 0)2 1 1 exp = exp (0) = = 0:39894 p2 " 2 # p2 p2 (c) If = 0 and 2 = 0:5; then Pr(x = 0) becomes 1 (0 0)2 1 1 exp = exp (0) = = 0:5642 p0:5 p2 " 2 0:5 # p0:5 p2 p0:5 p2 4. Why 2 =n ? As n ; the variance goes to zero. As the variance approaches to ! 1 zero, the probability of (^ = ) becomes unity. 5. : unknown mean. 6. ^n : the sample mean with the sample size of n: 7. n : as the sample size becomes in…nitely large. ! 1 2.2.2 Statistical Testing Central limit theorem (CLT): If yi is i.i.d. pn (^n ) d As n ; t^ = (0; 1) ! 1 2 ! N ^ q 5 What is i.i.d? independently identical distribution. independent: yi is not correlated with yj identical: The variance of yi is the same as that of yj T-statistics 1. Calculate the sample mean ^n 2 1 n 2 2. Calculate the sample variance ^ = (yi ^ ) n 1 i=1 n P Why (n 1) rather than n? You will learn (sure?) it later. Want to know now? Do the following. 2 2 n 1 n n 2 1 n (a) Show that E yi yi = E y yi i=1 n i=1 i=1 i n i=1 ! ! P2 2 P n 2 n P 2 P (b) Assume that Eyi = : Then E i=1 yi = i=1 Eyi =? 1 2 1 1 (c) Show that n y = Pn y2 + P n n y y : n i=1 i n2 i=1 i n i=1 i=j i j 6 P 2 P P P 1 n 2 (d) Show that E yi = =n if Eyiyj = 0 for i = j: n i=1 6 P 3. Calculate the t-statistics 4. Compare the critical value. (?) what is this? Testing (Size of the Test) Plot a bell shape standard normal density here 5% and 95% values: -1.65, 1.65 => Meaning: if a t-value is inside this range, ^ is the same as : – Any mistake? Yes, we allow 5% mistakes (both sides) => total 10%. Too large? 2.5% and 97.5%: -1.96, 1.96 6 – 5% total mistake. Still huge? Think about this. You are a MD. Mistakely you can’t…nd a cancer. 5%: Huge. Actual: 50%? – What if you are a judge? Is -1.96 a good number to reject that ^ is ? We call 5% or 2.5% as the size of the test: Probability of the rejection when the null is true. In Physics, the mistake rate should be less than 0.0001%. In Statistics, usually 5%. What if yi is not independent? No solution in cross sectional data if we don’tknow how to order yi Spatial analysis: order yi by using location. And assume that the dependence increases as yi is near by yj Whatelse? Nothing much. What if yi is not identical 2 1 n 2 Basically, it is okay. De…ne = limn i=1 i : !1 n Statistically ^2 2 : P ! Standard Deviation and Standard Error 2 1 n 2 Standard deviation (SD) ^ = ^ = (yi ^ ) n 1 i=1 n r q P Standard error (SE) s^ = ^=pn: So usually t^ == ^ s^ if = 0: How to estimate the variance of weighted Mean 2 n 2 2 ^ = ! (yi ^ ) or alternatively, ;we i=1 i we P n 2 2 i=1 !i n 2 ^ = 1 !i (yi ^ ) : ;we ( n ! )2 i=1 we Pi=1 i ! P CLT works withP^ and ^;we also. we 7 2.3 Median The median is the number separating the higher half of a data sample from the lower half. When median is used (Example) house price. Why? household income. Why not mean? Median is a better measure of the central tendency when a distribution is asymmetric or skewed. Example: {1,2,2,3,4,5,20}. Sample mean: 1+2+2+3+4+5+20 = 37:0. 37=7 = 5:2857: Median: 3. Is median e¢ cient than mean? What is e¢ cient? => smaller variance. Answer is "No" in general. 2.4 Quantile Mean Calculate 25% of the quantile and 75% of the quantile. And then take their average. Example: {1,2,3,4,5,6,7,8,9,100,200}. 25% quantile = 3, 75% quantile = 9. (3+9)/2 = 6. Median = 6. Mean = ? 8 3 Distribution Learn how to plot a probability density function (pdf) or distribution, and cumulative density function. Learn how to interpret a pdf and how to use a cdf. Basic: Histogram 1. Select bandwidth or interval size. Smaller interval size makes more fuzziness. Large interval size makes too roughness. 2. Count the total number of observations for each interval. 3. Plot histogram. Advance: Kernel Estimation Nonparametric estimation (based on parametric assumption). The formula is given by n 1 x xi f (x) = K nh h i=1 X Basically, one assume a particular kernel function (normal, uniform, epanechnikov for example), and for each point, one calculate its pseudo density, and then sum up all densities. Among many kernel function, epanechnikov kernel is known as the most e¢ cient one. Do you need to plot one? Use histogram. In many cases, you don’tneed to plot a pdf by yourself. The most important matter is whether or not the distribution is skewed. If a distribution becomes really skewed, then the better central tendency becomes median or quantile mean rather than the sample mean. Next, we will learn how to plot an empirical cumulative function. 9 Empirical Cumulative Density Function 1. Sort from the smallest to the largest. 2. Set the smallest as 1=n; the second smallest as 2=n; :::; and the largest as n=n: 3. Plot the data over the sequence (1=n; 2=n; ; 1): How to use Empirical CDF Testing whether or not observed samples follow a particular distribution. We call one sample distribution test. – Kolmogorov-Smirnov (KS) test – Anderson-Darling test – Watson test – Cramer-von Mises test etc.. 2 Suppose that yi (; ) ; then you can plot the theoretical cdf. Compare the N theoretical cdf with the empirical one. And then test whether or not yi is actually distributed as a normal. Also you can test whether or not two samples share the same distribution. We call this two-sample distribution test. Still you can use KS test by comparing the maximum di¤erence between two empirical cdfs. 4 Exercise 1 Use PCE_Inf_90item.xls. Check the assignment sheet. Find which years you need to use for this exercise. 1. Estimate the following statistics for each year: sample mean, median, quantile mean, 5% trimmed mean (low 5 and high 5), 5% winsorized mean.