Log Using E: Aidvol Aidan3a

NOTES 1. Stata is case sensitive. in general use lower case. 2. Put ‘quietly’ before a statement (do a search in word and see where it is used later) ensures there are no echoes on the screen use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 describe codebook api00 acs_k3 meals full yr_rnd summarize api00 acs_k3 meals full

The above illustrates some of the commands which can be used in STATA to summarize your data. The first line just reads data from a UCLA [University of California, Los Angeles] website. log using "e:\STATA\record.log" Above creates a log where your output etc is recorded. log close The above closes that log. You can then download this log into WORD.

BASICS

In Stata anything beginning with a * is a comment line and ignored by the computer.

There are various ways of loading data. If it is a stata data file then the use command is useful: use "C:\eurob\eurob591A.dta"

This loads a data file in STATA format. Now STATA is particular on its use of brackets and also quotation marks. May be best to block copy above and adapt. But perhaps the best way is to click on file in the STATA menu (top left) then open and select the data file, then block copy the statement that appears on the screen and put it into your batch program. infile aidpcgni growth grdyo gdpculc colony trend id gdppc gdpcuus gdpconus lhaiddy lhgrdy emerg aidtotus area inf earthq floodetc faminedr gendisas pop popden m2 conuscon govuscon oecdgdp totdis lhaconus lhaconlc invsh impsh expsh using "http://staff.bath.ac.uk/hssjrh/stat5a.raw" This reads the data when it is in ASCI format, raw data which you could read into word or ECEL, not surprisingly when you get to the end of line, carry on typing STATA seems able to read the line no matter how long, i.e. do not press return until the command has been finished. You first type infile, then the variable names (which are in columns, i.e. observation by observation) then where the data file is to be found with “using”.

But again perhaps the best way is to click on file in the STATA menu (top left) then import and then follow instructions, then block copy the statement that finally appears on the screen and put it into your batch program. iis id tis trend1 The above two commands identify the data, id takes a value of 1 for the first country, two for the second and so on. Used when using panel data, as the computer needs to be able to identify two which country and time period each observation belongs. The iis statement does the first of these tasks. The tis, does the second. You always need to specify tis when ever using time series data. trend1 is a simple time trend. Being as this is panel data this is repeated for every [in this case] country

TRANSFORMING DATA generate The basic statement is the generate command. The creates a new variable. generate lage=log(age) generate agesq=age*age generate educmale=educ*male

The first generate a variable called lage, the natural log of age; the second age: squared, the third creates an interactive variable equal to multiplying two variables together educ and male generate old=age>50 generate manual=occup==4

These generate two dummy variables; the first [old] takes a value of 1 if age is greater than 50. The second creates a variable manual if another variable occup equals 4 [note the use of ==]. occup could be a variable giving occupations 1 for office worker, 2 for professional and 4 for manual worker etc.

Below would generate a variable manprof if the individual was either coded 1 or 2. The | stands for ‘or’. generate manual=occup==1 | occup==2 replace This is used when you want to change an existing variable. For example generate age1=age<35 replace age1=2 if age>34 & age<60 replace age1=3 if age>59 age1 takes a value of 1 if age is less than 35, 2 if in middle age and 3 if age is greater than 59. Note the use of the conditional if statement. See later for more on this. generate x1=climchange==1 generate x2=climchange==2 generate x3=climchange==3 replace climchange=x1+x3*2+x2*3 Climchange is a variable originally coded 1 if the person thinks climate change is important, 2 if it is not and 3 if do not know. The above recodes the variable to go from important, dont know to not important. generate missy=aidpcgni==-999 | gdppc == -999 | id!=id[_n-1] | aidtotus ==-999 | gdpcuus ==-999 | gdpcuus[_n-1] ==-999

The above generates a variable ‘missy’ which takes a vale of 1 if aidpcgni=-999 OR [that’s what the | does] gddpc=-999 etc. -999 is commonly used in data sets as a missing value code, i.e. the data is not available. And I will later specify that actions e.g. regressions are done which are based on a data set which excludes missing observations. id!=id[_n-1] means the value of id [the country identifier] is not equal [!=] to id in the previous observation in the data set. If id and id[_n-1] are not equal then the current observation relates to the first observation for country 6 [e.g.] and id[_n-1] relates to the final observation for country 5. Hence taking lagged values which I use in the regressions is not valid, hence the first observations of each country need to be excluded]. NOTE the way we define lagged values.

Further Examples of data generating statements generate disast=max(floodetc,earthq,faminedr,gendisas) disast = the biggest of the variables specified, i.e. max stands for maximum generate disastl1 =disast[_n-1] if id==id[_n-1] replace ldisast1=log(disast1) if disast1>0 This creates ldisast1 as the log of disast if positive, if not then it remains as zero. generate totdis02 =totdis+totdis[_n-1]+totdis[_n-2] if id==id[_n-2] This creates the sum of disasters in current and previous two years, if the previous two observations relate to the same country. generate aidpcyal1=aidpcya[_n-1] if id==id[_n-2] This creates a lagged variable if the observation two periods prior to the current one is for the same country. for all observations where id !=id[_n-2] we have a missing value code. generate lhaiddypl1=lhaiddyp[_n-1] if id==id[_n-1] This creates the lag if for the same country [note the use of ==] generate lgdprat=log(gdpconus/gdpconus[_n-1]) if gdpconus>0 & id==id[_n-1] This variable is created if both gdpconus>0 & id==id[_n-1] are true NOTE the use of &

REGRESSIONS OLS regress invsh trend aidpcya lhaiddyp lhaiddyn lgdppcl2 asia samerica ssa disast1 lpopden regresses (OLS) a variable ‘invsh’ on the right hand side variables: trend aidpcya lhaiddyp lhaiddyn lgdppcl2 asia samerica ssa disast1 lpopden

Below does an OLS regression ‘robust’ uses ‘White’s test’ to correct standard errors and t statistics for heteroscdasticity. Its an option you do not have to have it. But it can be used in several types of estimator not just OLS. regress invsh trend aidpcya lhaiddyp lhaiddyn lhaiddypl1 lgdppcl2 asia samerica ssa disast1 lpopden wgrowth,robust

PROBIT probit foreign price gpm displacement foreign is coded 1 or zero and this does a probit regression on the right hand side variables.

PANEL DATA Fixed Effects xtreg invsh trend aidpcya lhaiddyp lhaiddyn lgdppcl2 asia samerica ssa disast1 lpopden wgrowth if missy1in==0,fe

This does a panel data regression with fixed effects. Xtreg, indicates that panel data techniques are to be used. The dependent variable is the first, invsh, the right hand variables follow, this is done for a sample where missy1in equals 1 and it is done with fixed effects [fe]

Random Effects xtreg invsh trend aidpcya lhaiddyp lhaiddyn lgdppcl2 asia samerica ssa disast1 lpopden wgrowth if missy1in==0,re

MAXIMUM LIKELIHOOD

You can set likelihood functions in many contexts. Below is an example, but we could more easily just use the probit command. Where this is important is when you want to set up a likelihood function which is not available as a preset command.

 1 yi  yi L    X i  1   X i  yi 0

Above is the likelihood function for the probit model taken from the notes. Below we have a subroutine, a program within a program. It is called myprobit. Once it has been written it can be ‘called’ at any point within the program, any number of times. I am not fully certain what it does, or rather how it does it. But it is creating the log likelihood function for probit $ML_y1 is the dependent variable. xb; x is the set of right hand side variables and b the coefficient vector we are trying to maximize. normal(-`xb') calculates the cdf (cumulative density function) corresponds to  X i  in the likelihood function above and 1-normal(`xb') corresponds to 1  X i . the two ‘if statements’ ensure that the first replace statement only applies to those observations where the dependent variable is coded 1 [and 1-normal(-`xb') represents the probability of them doing so]. The second replace statement only applies to those coded 0.

program define myprobit args lnf xb *summ $ML_y1 quietly replace `lnf'=ln(1-normal(-`xb')) if $ML_y1==1 quietly replace `lnf'=ln(normal(-`xb')) if $ML_y1==0 end

Now with respect to below sysuse auto, loads a data base stored on the STATA system as an exampe file. gen is short for generate; ml model lf myprobit (foreign = price gpm displacement) defines the structure of the likelihood function and ml maximize tells STATA to go ahead and maximise this. sysuse auto gen gpm=1/mpg ml model lf myprobit (foreign = price gpm displacement) ml maximize

. program define myprobit 1. . args lnf xb 2. . *summ $ML_y1 . . quietly replace `lnf'=ln(1-normal(-`xb')) if $ML_y1==1 3. . quietly replace `lnf'=ln(normal(-`xb')) if $ML_y1==0 4. . end

. sysuse auto (1978 Automobile Data)

. . gen gpm=1/mpg

. . ml model lf myprobit (foreign = price gpm displacement)

. . ml maximize initial: log likelihood = -51.292891 alternative: log likelihood = -45.055272 rescale: log likelihood = -45.055272 Iteration 0: log likelihood = -45.055272 Iteration 1: log likelihood = -18.335586 Iteration 2: log likelihood = -13.545628 Iteration 3: log likelihood = -11.816519 Iteration 4: log likelihood = -11.579872 Iteration 5: log likelihood = -11.569982 Iteration 6: log likelihood = -11.569946 Iteration 7: log likelihood = -11.569946

Number of obs = 74 Wald chi2(3) = 7.42 Log likelihood = -11.569946 Prob > chi2 = 0.0597

foreign Coef. Std. Err. z P>|z| [95% Conf. Interval]

price .0008069 .0003465 2.33 0.020 .0001278 .0014861 gpm 103.3053 64.80528 1.59 0.111 -23.71066 230.3214 displacement -.0777907 .0317469 -2.45 0.014 -.1400134 -.015568 _cons .815499 1.689588 0.48 0.629 -2.496033 4.127032

Compare the results with simply using probit foreign price gpm displacement

. probit foreign price gpm displacement

Iteration 0: log likelihood = -45.03321 Iteration 1: log likelihood = -22.071496 Iteration 2: log likelihood = -16.555118 Iteration 3: log likelihood = -13.876723 Iteration 4: log likelihood = -12.396166 Iteration 5: log likelihood = -11.741893 Iteration 6: log likelihood = -11.586421 Iteration 7: log likelihood = -11.570237 Iteration 8: log likelihood = -11.569947 Iteration 9: log likelihood = -11.569946

Probit regression Number of obs = 74 LR chi2(3) = 66.93 Prob > chi2 = 0.0000 Log likelihood = -11.569946 Pseudo R2 = 0.7431

foreign Coef. Std. Err. z P>|z| [95% Conf. Interval]

price .0008069 .0003465 2.33 0.020 .0001278 .0014861 gpm 103.3053 64.80527 1.59 0.111 -23.71065 230.3213 displacement -.0777907 .0317469 -2.45 0.014 -.1400134 -.015568 _cons .815499 1.689588 0.48 0.629 -2.496033 4.127031

They are the same.

TESTS There are tests available in the menu. Go to statistics on the menu, then postestimation, then tests and you will see a range of tests. Below we do some ‘by hand’.

RAMSEY RESET of commands below does the Ramsey reset test. It obtains predicted values for the left hand side variable and residuals which call res. I subtract these residuals from congpr to obtain predicted values which I then square (i.e. conprs ends up being the square of the predicted value which is then used in the regression). For some reason I did not like the conprs from the statement ‘predict conprs, res’ which should result in conprs giving predicted values. sysuse auto regress price mpg weight length foreign predict pprice, xb generate ppricesq=pprice*pprice generate ppricecu=pprice^3 regress price mpg weight length foreign ppricesq ppricecu test ppricesq ppricecu . . regress price mpg weight length foreign ppricesq ppricecu

Source SS df MS Number of obs = 74 F( 6, 67) = 30.53 Model 465004722 6 77500787 Prob > F = 0.0000 Residual 170060674 67 2538219.02 R-squared = 0.7322 Adj R-squared = 0.7082 Total 635065396 73 8699525.97 Root MSE = 1593.2

price Coef. Std. Err. t P>|t| [95% Conf. Interval]

mpg -2.718599 58.5539 -0.05 0.963 -119.5927 114.1555 weight -14.92438 6.247209 -2.39 0.020 -27.39386 -2.454904 length 266.6463 106.462 2.50 0.015 54.14728 479.1453 foreign -7575.76 3740.546 -2.03 0.047 -15041.92 -109.5982 ppricesq .00037 .0001856 1.99 0.050 -4.37e-07 .0007404 ppricecu -9.52e-09 9.74e-09 -0.98 0.332 -2.90e-08 9.92e-09 _cons -9307.403 6234.479 -1.49 0.140 -21751.48 3136.67

. . test ppricesq ppricecu

( 1) ppricesq = 0 ( 2) ppricecu = 0 Constraint 2 dropped

F( 1, 67) = 3.97 Prob > F = 0.0503

This is significant at the 10% level, but not the 5% level. Hence at the 5% level the regression ‘passes’ the Ramsey rest test. Note the use of test to test the joint significance of two variables.

HAUSMANN TEST Fixed vs Random effects The sequence of commands below does the Hausman test for random vs fixed effects. I did a search on the web for hausman fixed and stata an came up with a number of references including: http://www.stata.com/help.cgi?hausman Look it up it helps explain the Hausman test and perhaps give greater clarity to what we did in the notes xtreg invsh trend aidpcya lhaiddyp lhaiddyn lgdppcl2 disasl12 lpopden wgrowth if missy2in==0,fe est store fixed xtreg invsh trend aidpcya lhaiddyp lhaiddyn lgdppcl2 disasl12 lpopden wgrowth if missy2in==0,re xttest0 hausman fixed

DIAGRAMS

EXAMPLES

Blockcopy the following and run in STATA: use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 regress api00 acs_k3 meals full histogram acs_k3 graph matrix api00 acs_k3 meals full, half twoway (scatter api00 enroll) (lfit api00 enroll) and compare with : http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter1/statareg1.htm

LOOPS The data is 23 years on roughly a 100 countries. rcorrupt ranks countries in each period in terms of corrupt. The important point is you can use loops. generate rcorrupt=0 forvalues i1 = 1(1)23 { scalar i6=`i1' egen x1 =rank(corrupt) if trend==i6 & corrupt>-0.01 replace rcorrupt=x1 if trend==i6 & corrupt>-0.01 count if trend==i6 & corrupt>-0.01 replace rcorrupt =rcorrupt/r(N)*100 if trend==i6 & corrupt>-0.01 drop x1 }

The above does a loop from values 1 to 23 [there are 26 years in the data]. first time through the loop i6 and thus `i1' equals 1 second time 2 and so on. x1 is the rank of corrupt in period i6 for those countries where corrupt is defined [>-0.01]. rcorrupt is then set so that for each period it equals the ranking of corrupt. Thus if country M has the highest value for corrupt in period 4 it will be ranked 1 for that period. Because there are a different number of valid countries in different periods, the loop normalises this variable to vary between 1 and a 100: count if trend==i6 & corrupt>-0.01 counts the number of observations for which the condition trend==i6 & corrupt>-0.01 is satisfied. This is stored as r(N).

egen creates newvar of the optionally specified storage type equal to fcn(arguments). Here fcn() is a function specifically written for egen, as documented below or as written by users. Only egen functions may be used with egen, and conversely, only egen may be used to run egen functions. Depending on fcn(), arguments, if present, refers to an expression, varlist, or a numlist, and the options are similarly fcn dependent. Explicit subscripting (using _N and _n), which is commonly used with generate, should not be used with egen; see subscripting.

Below is another loop which for each country calculates base year GDP for 177 countries generate basegdppc=-9 forvalues i1 = 1(1)177 { scalar i6=`i1' scalar i3=23 scalar sx1= -9 while sx1 <0 & i3>0{ scalar i3=i3-1 scalar k3= `i1' * 23-i3 scalar sx1= gdppc[k3] } quietly replace basegdppc =sx1 if country==i6 }

EXAMPLE PROGRAM WITH GRAPHS; TABLES ETC

Go to the web and go to this address: http://staff.bath.ac.uk/hssjrh/pubservs.dta when it asks you what you want to do save ‘save it’ to some file on your computer or flash drive. Then open STATA go to file on the menu top left corner. Then ‘open’ and in the browser load the file. It will tell you you do not have enough room. But it should have given you a use statement something like: use "C:\courses\econometrics\pubservs.dta", clear block copy it and put it after set mem 100000. Then run the program clear set mem 100000 use "C:\courses\econometrics\pubservs.dta", clear * generate country = belgium+denmark*2+(wgermany+ egermany)*3+ greece*4 +spain*5 + france*6 + ireland*7 + italy*8 +luxembourg*9 + netherlands*10 +austria*11 +portugal*12 +sweden*13 + (gbritain+nireland)*14 +cyprus*15 +czech*16 +estonia*17 +hungary*18 +latvia*19 +lithuania*20 +malta*21 +poland*22 +slovakia*23 +slovenia*24 +bulgaria*25 +romania*26 generate finland=0 replace finland=1 if country==0 replace country=country+finland*27 generate str countryn="belgium" replace countryn="denmark" if country==2 replace countryn="germany" if country==3 replace countryn="greece" if country==4 replace countryn="spain" if country==5 replace countryn="finland" if country==27 replace countryn="france" if country==6 replace countryn="ireland" if country=7 replace countryn="italy" if country==8 replace countryn="luxembourg" if country==9 replace countryn="netherlands" if country==10 replace countryn="austria" if country==11 replace countryn="portugal" if country==12 replace countryn="sweden" if country==13 replace countryn="uk" if country==14 replace countryn="cyprus" if country==15 replace countryn="Czech R" if country==16 replace countryn="estonia" if country==17 replace countryn="hungary" if country==18 replace countryn="latvia" if country==19 replace countryn="lithuania" if country==20 replace countryn="malta" if country==21 replace countryn="poland" if country==22 replace countryn="slovakia" if country==23 replace countryn="slovenia" if country==24 replace countryn="bulgaria" if country==25 replace countryn="romania" if country==26 tabstat hlthqual hlthaccess if hlthaccess<5 & hlthqual<5,stats(mean max min var sk kurtosis) summ hlthaccess hlthqual if hlthaccess<5 & hlthqual<5 bysort male: summ hlthaccess hlthqual if hlthaccess<5 & hlthqual<5 bysort manual: summ hlthaccess hlthqual if hlthaccess<5 & hlthqual<5 bysort marrd: summ hlthaccess hlthqual if hlthaccess<5 & hlthqual<5 bysort unemp: summ hlthaccess hlthqual if hlthaccess<5 & hlthqual<5 set matsize 800 *ACCESS * generate missy1=educ<5 & age<98 & satfin<6 & satnghb<6 & satnghb<6 & satsocl<6 oprobit hlthaccess male age agesq village town marrd educ unemp manual satfin satnghb satsocl unemp manual trust hlpothers relimp volwimp meetfr natdev goodhlth belgium denmark wgermany egermany greece spain france ireland italy luxembourg netherlands austria portugal sweden gbritain nireland cyprus czech estonia hungary latvia lithuania malta poland slovakia slovenia bulgaria romania if hlthaccess<5 & missy1==1 est store HEACC * *QUALITY oprobit hlthqual male age agesq village town marrd educ unemp manual satfin satnghb satsocl unemp manual trust hlpothers relimp volwimp meetfr natdev goodhlth belgium denmark wgermany egermany greece spain france ireland italy luxembourg netherlands austria portugal sweden gbritain nireland cyprus czech estonia hungary latvia lithuania malta poland slovakia slovenia bulgaria romania if hlthqual<5 & missy1==1 est store HEQUA

*EDUCATION SYSTEMS generate Explanation =eduqual histogram Explanation, discrete percent ylabel(none, labsize(minuscule)) xtitle(, box bexpand) xscale(range(.1 .1)) xlabel(none, labsize(minuscule)) legend(off) ysize(20) xsize(20) fysize(200) fxsize(200) by(countryn,iscale(*1.5) title(Figure 1: Distribution of Education Quality Perceptions, size(medsmall)) note(Histogram from approve to disapprove as we move left to right, position(6)) legend(off)) oprobit eduaccess male age agesq village town marrd educ unemp manual satfin satnghb satsocl unemp manual trust hlpothers relimp volwimp meetfr natdev goodhlth belgium denmark wgermany egermany greece spain france ireland italy luxembourg netherlands austria portugal sweden gbritain nireland cyprus czech estonia hungary latvia lithuania malta poland slovakia slovenia bulgaria romania if eduaccess<5 & missy1==1 est store EEACC * *QUALITY oprobit eduqual male age agesq village town marrd educ unemp manual satfin satnghb satsocl unemp manual trust hlpothers relimp volwimp meetfr natdev goodhlth belgium denmark wgermany egermany greece spain france ireland italy luxembourg netherlands austria portugal sweden gbritain nireland cyprus czech estonia hungary latvia lithuania malta poland slovakia slovenia bulgaria romania if eduqual<5 & missy1==1 est store EEQUA oprobit trainacces male age agesq village town marrd educ unemp manual satfin satnghb satsocl unemp manual trust hlpothers relimp volwimp meetfr natdev goodhlth belgium denmark wgermany egermany greece spain france ireland italy luxembourg netherlands austria portugal sweden gbritain nireland cyprus czech estonia hungary latvia lithuania malta poland slovakia slovenia bulgaria romania if trainaccess<5 & missy1==1 est store TRACC oprobit trainqual male age agesq village town marrd educ unemp manual satfin satnghb satsocl unemp manual trust hlpothers relimp volwimp meetfr natdev goodhlth belgium denmark wgermany egermany greece spain france ireland italy luxembourg netherlands austria portugal sweden gbritain nireland cyprus czech estonia hungary latvia lithuania malta poland slovakia slovenia bulgaria romania if trainqual<5 & missy1==1 est store TRQUA

*TABLES **log using "C:\eurob\pubservs1.log",replace est table HEACC HEQUA EEACC EEQUA TRACC TRQUA, b(%6.4g) varwidth(10) stats(N ll chi2 df_m aic) t style(noline) equations(1) stfmt(%7.4g) *log close

NOTES ON ABOVE generate str countryn="belgium" generates a string variable i.e. one with words not numbers. set matsize 800 sets the matrix size large enough to handle a big regression. Perhaps not necessary in this case. oprobit hlthaccess male age agesq village town marrd educ unemp manual satfin satnghb satsocl unemp manual trust hlpothers relimp volwimp meetfr natdev goodhlth belgium denmark wgermany egermany greece spain france ireland italy luxembourg netherlands austria portugal sweden gbritain nireland cyprus czech estonia hungary latvia lithuania malta poland slovakia slovenia bulgaria romania if hlthaccess<5 & missy1==1 est store HEACC est store HEACC stores the estimated regression results from the previous regression in a ‘box’ called HEACC [You can call ‘the box’ anything you like. est table HEACC HEQUA EEACC EEQUA TRACC TRQUA, b(%6.4g) varwidth(10) stats(N ll chi2 df_m aic) t style(noline) equations(1) stfmt(%7.4g)

The above prints out the stored regression in a nice neat table with t statistics.