Econometrics in Practice

Donggyu Sul

2015

Abstract

1 1 Introduction: Types of Data

There are three types of data: Cross sectional data, Time series data, Panel data.

1.0.1 Cross sectional data (i = 1; :::; n)

- True cross sectional data: Time invariant. Example; personality, preference, I.Q., blood type. - Pseudo cross sectional data: Time varying. Example; personal income, real estate prices,

1.0.2 Time series data (t = 1; ::; T )

- Stationary data: The variance is not time varying. Example; interest rate, Texas state income growth rates - Nonstationary data: The variance is time varying, and sometimes, the variance is increasing over time. Example; US income level, stock prices, exchange rates.

1.0.3 Panel data

Combining with cross and time. - short panel: large n but small T - long panel: small n but large T

Depending on a type of the data, the of interest are in general di¤erent.

2 Part I Cross Section Data

Parameters of interest: and shape of distribution.  Learn the di¤erence between the true and pseudo cross sectional data  Learn how to compare two samples. (independent and dependent)  Learn how to model a linear regression  Learn the role of missing variables on the regression results 

3 2 Central Tendency

Mean, , Quantile . Example: Female and male income comparison.

2.1 Mean: Statistics

Sample mean:

1 n ^ = y n i i=1 X Weighted mean:

n n

^we = !iyi; where !i = 1: i=1 i=1 X X Trimmed mean: Discard a small percentage of the largest and smallest values before calculating the mean. Let y1 y2 yn: Then       n m 1 ^ = y ; = m=n: trim n 2m i i=m+1 X

Winsorized mean: Set y1 = ym+1; ; ym = ym+1 and yn = ym n; yn 1 = ym n; ; ym n+1 =       ym n: Then take the mean. Example: n = 50: = 2%: The 2% winsorized mean is

original sample: from smallest to largest y1; y2; y3; y4; y5; ; y46; y47; y48; y49; y50    Trimmed mean 2%: 0; y2; y3; y4; y5; ; y46; y47; y48; y49; 0    Winsorized mean 2% y2; y2; y3; y4; y5; ; y46; y47; y48; y49; y49   

Googling: Trimmed Mean PCE In‡ation Rate http://www.dallasfed.org/research/pce/

2.2 Mean: Statistical Inference

How to evaluate whether or not the sample mean is meaningful. Or test if the estimated sample mean is di¤erent from a referenced value (for example, zero or unity)

Basic idea: Use central limit theorem which is a nice statistical device.

4 Central limit theorem (CLT) 2 1 n d  Let ^ = yi: As n ; (^ ) 0; : n n i=1 ! 1 n ! N n   P 2.2.1 Interpretation:

1. dthe distribution (d) approaches to ! 2. (0; v): Normal distribution with the variance of v: N 3. Normal distribution: probability density function or Pr(x = c) :

(a) Example: (Probability of x = 0)

1 (x )2 exp 2 p2 " 2 #

(b) If  = 0 and 2 = 1 (standard normal distribution), then Pr(x = 0) becomes

1 (0 0)2 1 1 exp = exp (0) = = 0:39894 p2 " 2 # p2 p2

(c) If  = 0 and 2 = 0:5; then Pr(x = 0) becomes

1 (0 0)2 1 1 exp = exp (0) = = 0:5642 p0:5 p2 " 2 0:5 # p0:5 p2 p0:5 p2     4. Why 2 =n ? As n ; the variance goes to zero. As the variance approaches to  ! 1 zero, the probability of (^ = ) becomes unity.

5.  : unknown mean.

6. ^n : the sample mean with the sample size of n:

7. n : as the sample size becomes in…nitely large. ! 1

2.2.2 Statistical Testing

Central limit theorem (CLT): If yi is i.i.d.

pn (^n ) d As n ; t^ = (0; 1) ! 1 2 ! N ^ q 5 What is i.i.d?

independently identical distribution. 

independent: yi is not correlated with yj 

identical: The variance of yi is the same as that of yj 

T-statistics

1. Calculate the sample mean ^n

2 1 n 2 2. Calculate the sample variance ^ = (yi ^ )  n 1 i=1 n P Why (n 1) rather than n? You will learn (sure?) it later.  Want to know now? Do the following.  2 2 n 1 n n 2 1 n (a) Show that E yi yi = E y yi i=1 n i=1 i=1 i n i=1   !   ! P2 2 P n 2 n P 2 P (b) Assume that Eyi =  : Then E i=1 yi = i=1 Eyi =? 1 2 1 1 (c) Show that n y = Pn y2 + P n n y y : n i=1 i n2 i=1 i n i=1 i=j i j   6 P 2 P P P 1 n 2 (d) Show that E yi =  =n if Eyiyj = 0 for i = j: n i=1 6   P 3. Calculate the t-statistics

4. Compare the critical value. (?) what is this?

Testing (Size of the Test)

Plot a bell shape standard normal density here  5% and 95% values: -1.65, 1.65 => Meaning: if a t-value is inside this range, ^ is the  same as :

– Any mistake? Yes, we allow 5% mistakes (both sides) => total 10%. Too large?

2.5% and 97.5%: -1.96, 1.96 

6 – 5% total mistake. Still huge? Think about this. You are a MD. Mistakely you can’t…nd a cancer. 5%: Huge. Actual: 50%? – What if you are a judge? Is -1.96 a good number to reject that ^ is ?

We call 5% or 2.5% as the size of the test: Probability of the rejection when the null  is true.

In Physics, the mistake rate should be less than 0.0001%. In Statistics, usually 5%. 

What if yi is not independent?

No solution in cross sectional data if we don’tknow how to order yi 

Spatial analysis: order yi by using location. And assume that the dependence increases  as yi is near by yj

Whatelse? Nothing much. 

What if yi is not identical

2 1 n 2 Basically, it is okay. De…ne  = limn i=1 i :  !1 n Statistically ^2 2 : P   ! 

Standard Deviation and Standard Error

2 1 n 2 Standard deviation (SD) ^ = ^ = (yi ^ )   n 1 i=1 n r q P Standard error (SE) s^ = ^=pn: So usually t^ == ^ s^ if  = 0: 

How to estimate the variance of weighted Mean

2 n 2 2 ^ = ! (yi ^ ) or alternatively,  ;we i=1 i we P n 2 2 i=1 !i n 2 ^ = 1 !i (yi ^ ) :  ;we ( n ! )2 i=1 we Pi=1 i ! P CLT works withP^ and ^;we also.  we

7 2.3 Median

The median is the number separating the higher half of a data sample from the lower half.

When median is used (Example)

house price. Why?  household income. Why not mean? 

Median is a better measure of the central tendency when a distribution is asymmetric or skewed. Example: {1,2,2,3,4,5,20}. Sample mean: 1+2+2+3+4+5+20 = 37:0. 37=7 = 5:2857: Median: 3.

Is median e¢ cient than mean?

What is e¢ cient? => smaller variance.  Answer is "No" in general. 

2.4 Quantile Mean

Calculate 25% of the quantile and 75% of the quantile. And then take their . Example: {1,2,3,4,5,6,7,8,9,100,200}. 25% quantile = 3, 75% quantile = 9. (3+9)/2 = 6. Median = 6. Mean = ?

8 3 Distribution

Learn how to plot a probability density function (pdf) or distribution, and cumulative density function. Learn how to interpret a pdf and how to use a cdf.

Basic: Histogram

1. Select bandwidth or interval size. Smaller interval size makes more fuzziness. Large interval size makes too roughness.

2. Count the total number of observations for each interval.

3. Plot histogram.

Advance: Kernel Estimation

Nonparametric estimation (based on parametric assumption).  The formula is given by  n 1 x xi f (x) = K nh h i=1 X   Basically, one assume a particular kernel function (normal, uniform, epanechnikov for  example), and for each point, one calculate its pseudo density, and then sum up all densities.

Among many kernel function, epanechnikov kernel is known as the most e¢ cient one.  Do you need to plot one? Use histogram. 

In many cases, you don’tneed to plot a pdf by yourself. The most important matter is whether or not the distribution is skewed. If a distribution becomes really skewed, then the better central tendency becomes median or quantile mean rather than the sample mean.

Next, we will learn how to plot an empirical cumulative function.

9 Empirical Cumulative Density Function

1. Sort from the smallest to the largest.

2. Set the smallest as 1=n; the second smallest as 2=n; :::; and the largest as n=n:

3. Plot the data over the sequence (1=n; 2=n; ; 1):   

How to use Empirical CDF

Testing whether or not observed samples follow a particular distribution. We call one  sample distribution test.

– Kolmogorov-Smirnov (KS) test – Anderson-Darling test – Watson test – Cramer-von Mises test etc..

2 Suppose that yi (;  ) ; then you can plot the theoretical cdf. Compare the   N theoretical cdf with the empirical one. And then test whether or not yi is actually distributed as a normal.

Also you can test whether or not two samples share the same distribution. We call this  two-sample distribution test. Still you can use KS test by comparing the maximum di¤erence between two empirical cdfs.

4 Exercise 1

Use PCE_Inf_90item.xls. Check the assignment sheet. Find which years you need to use for this exercise.

1. Estimate the following statistics for each year: sample mean, median, quantile mean, 5% trimmed mean (low 5 and high 5), 5% winsorized mean.

10 2. Use the sample mean and sample variance: Test whether or not the mean of the in‡ation rate is equal to 2% for each year.

3. Plot historgram for each year.

4. Plot distribution for each year by using at least two kernels.

5. Plot empirical CDF.

6. Test whether or not the observed series are normaly distributed. (KS test)

7. Test whether or not the …rst year sample shares the same distribution with the second year sample. Do the same test with the second and third year.

11