<<

Declare data By declaring data type, you enable Stata to apply data munging and analysis functions specific to certain data types with Stata Cheat Sheet TIME SERIES webuse sunspot, clear PANEL / LONGITUDINAL webuse nlswork, clear For more info, see Stata’s reference manual (stata.com) tsset time, yearly xtset id year Results are stored as either -class or e -class. See Programming Cheat Sheet declare sunspot data to be yearly time series declare national longitudinal data to be a panel Examples use auto.dta (sysuse auto, clear) tsreport Summarize data unless otherwise noted r xtdescribe xtline plot report time-series aspects of a dataset report panel aspects of a dataset wage relative to inflation r id 1 id 2 univar price mpg, boxplot ssc install univar generate lag_spot = L1.spot xtsum hours 4 2 calculate univariate summary with box-and-whiskers plot create a new variable of annual lags of sunspots tsline plot summarize hours worked, decomposing 200 Number of sunspots 0 stem mpg standard deviation into between and id 3 id 4 tsline spot 4 return stem-and-leaf display of mpg plot time series of sunspots 100 within components frequently used commands are 0 2 e 1850 1900 1950 xtline ln_wage if id <= 22, tlabel(#3) summarize price mpg, detail highlighted in yellow 0 arima spot, ar(1/2) 1970 1980 1990 calculate a variety of univariate summary fit an autoregressive model with 2 lags plot panel data as a line plot for Stata 13: ci mpg price, level (99) xtreg ln_w .age##c.age ttl_exp, fe vce(robust) ci mean mpg price, level(99) TIME-SERIES OPERATORS e L. lag x L2. 2-period lag x fit a fixed-effects model with robust standard errors r compute standard errors and confidence intervals t-1 t-2 F. lead x F2. 2-period lead x webuse nhanes2b, clear correlate mpg price t+1 t+2 SURVEY DATA D. difference x -x D2. difference of difference x -x -(x -x ) t t-1 t t−1 t−1 t−2 svyset psuid [pweight = finalwgt], strata(stratid) return correlation or covariance matrix S. seasonal difference x -x S2. lag-2 (seasonal difference) x −x t t-1 t t−2 declare survey design for a dataset pwcorr price mpg weight, star(0.05) USEFUL ADD-INS r return all pairwise correlation coefficients with sig. levels tscollap compact time series into means, sums, and end-of-period values svydescribe mean price mpg carryforward carry nonmissing values forward from one obs. to the next report survey-data details estimates of means, including standard errors tsspell identify spells or runs in time series svy: mean age, over(sex) proportion rep78 foreign SURVIVAL ANALYSIS webuse drugtr, clear estimate a population mean for each subpopulation estimates of proportions, including standard errors for stset studytime, failure(died) svy, subpop(rural): mean age e categories identified in varlist estimate a population mean for rural areas r declare survey design for a dataset e ratio price/mpg stsum svy: tabulate sex heartatk estimates of ratio, including standard errors summarize survival-time data report two-way table with tests of independence total price e stcox drug age svy: reg zinc c.age##c.age female weight rural estimates of totals, including standard errors fit a Cox proportional hazards model estimate a regression using survey weights Statistical tests 1 Fit models stores results as e -class 2 Diagnostics some are inappropriate with robust SEs tabulate foreign rep78, chi2 exact expected regress price mpg weight, vce(robust) estat hettest test for heteroskedasticity tabulate foreign and repair record and return chi2 fit ordinary least-squares (OLS) model r ovtest test for omitted-variable bias and Fisher’s exact statistic alongside the expected values on mpg, weight, and foreign, apply robust standard errors vif report variance inflation factor regress price mpg weight if foreign == 0, vce(cluster rep78) dfbeta(length) Type help regress postestimation plots ttest mpg, by(foreign) for additional diagnostic plots estimate t test on equality of means for mpg by foreign regress price only on domestic cars, cluster standard errors calculate measure of influence rreg price mpg weight, genwt(reg_wt) rvfplot, yline(0) avplots r price price prtest foreign == 0.5 estimate to eliminate outliers plot residuals mpg rep78 plot all partial-

one-sample test of proportions probit foreign turn price, vce(robust) Residuals against fitted regression leverage ADDITIONAL MODELS price price Fitted values headroom weight ksmirnov mpg, by(foreign) exact estimate probit regression with pca built-in Stata principal components analysis values plots in one graph command Kolmogorov–Smirnov equality-of-distributions test robust standard errors factor commands that use a fitted model logit foreign headroom mpg, or poisson • nbreg count outcomes 3 Postestimation ranksum mpg, by(foreign) tobit user-written censored data estimate and Used in all postestimation examples equality tests on unmatched data (independent samples) ivregress ivreg2 instrumental variables regress price headroom length report odds ratios didregress anova systolic drug webuse systolic, clear bootstrap, reps(100): regress mpg /* difference-in-difference display _b[length] display _se[length] analysis of variance and covariance rd ssc install ivreg2 regression discontinuity return coefficient estimate or standard error for length */ weight gear foreign xtabond xtdpdsys dynamic panel estimator e from most recent regression model pwmean mpg, over(rep78) pveffects mcompare(tukey) estimate regression with bootstrapping teffects psmatch propensity score matching margins, dydx(length) returns e-class information when post option is used estimate pairwise comparisons of means with equal synth synthetic control analysis jackknife r(mean): sum mpg return the estimated marginal effect for length variances include multiple comparison adjustment jackknife standard error of sample mean oaxaca Blinder–Oaxaca decomposition r margins, eyex(length) Estimation with categorical & factor variables more details at https://www.stata.com/manuals/u26.pdf return the estimated elasticity for length CONTINUOUS VARIABLES OPERATOR DESCRIPTION EXAMPLE predict yhat if e(sample) measure something i. specify indicators regress price i.rep78 specify rep78 variable to be an indicator variable create predictions for sample on which model was fit ib. specify base indicator regress price ib(3).rep78 set the third category of rep78 to be the base category CATEGORICAL VARIABLES predict double resid, residuals fvset command to change base fvset base frequent rep78 set the base to most frequently occurring category for rep78 calculate residuals based on last fitted model identify a group to which c. treat variable as continuous regress price i.foreign#c.mpg i.foreign treat mpg as a continuous variable and an observation belongs specify an interaction between foreign and mpg test headroom = 0 INDICATOR VARIABLES o. omit a variable or indicator regress price io(2).rep78 set rep78 as an indicator; omit observations with rep78 == 2 test linear hypotheses that headroom estimate equals zero # specify interactions regress price mpg c.mpg#c.mpg create a squared mpg term to be used in regression r TF denote whether lincom headroom - length something is true or false ## specify factorial interactions regress price c.mpg##c.mpg create all possible interactions with mpg (mpg and mpg2) estimate linear combination (headroom - length)

Tim Essam ([email protected]) • Laura Hughes ([email protected]) inspired by RStudio’s awesome Cheat Sheets (.com/resources/cheatsheets) geocenter.github.io/StataTraining updated May 2021 follow us @StataRGIS and @flaneuseks Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0 Building blocks basic components of programming Loops: Automate repetitive tasks Programming * R- AND E-CLASS: Stata stores calculation results in two main classes: ANATOMY OF A LOOP see also while with Stata Cheat Sheet return results from general commands return results from estimation For more info, see Stata’s reference manual (stata.com) Stata has three options for repeating commands over lists or values: r such as summarize or tabulate e commands such as regress or mean foreach, forvalues, and while. Though each has a different first line, 1 Scalars both r- and e-class results contain scalars To assign values to individual variables use: the syntax is consistent: objects to repeat over scalar x1 = 3 1 SCALARS r individual numbers or strings Scalars can hold foreach x of varlist var1 var2 var3 { open brace must create a scalar x1 storing the number 3 numeric values or 2 MATRICES e rectangular array of quantities or expressions appear on first line arbitrarily long strings temporary variable used scalar a1 = “I am a string scalar” 3 MACROS e pointers that store text (global or local) only within the loop create a scalar a1 storing a string * there’s also s- and n-class requires local macro notation 2 Matrices e-class results are stored as matrices 4 Access & save stored r- and e-class objects command `x', option command(s) you want to repeat ... can be one line or many a 4\ 5\ 6 b 7, 8, 9 Many Stata commands store results in types of lists. To access these, use return or matrix = ( ) matrix = ( ) close brace must appear create a 3 x 1 matrix create a 1 x 3 matrix ereturn commands. Stored results can be scalars, macros, matrices, or functions. } on final line by itself summarize price, detail mean price matrix d = b' transpose matrix b; store in d r return list e ereturn list FOREACH: REPEAT COMMANDS OVER STRINGS, LISTS, OR VARIABLES returns a list of scalars returns list of scalars, macros, matrix ad1 = a \ d matrix ad2 = a , d matrices, and functions foreach x in|of [ local, global, varlist, newlist, numlist ] { row bind matrices column bind matrices scalars: scalars: list types: objects over which the r(N) = 74 Results are replaced e(df_r) = 73 Stata commands referring to `x' commands will be repeated search matselrc r(mean) = 6165.25... each time an r-class e(N_over) = 1 } matselrc b x, c(1 3) loops repeat the same command r(Var) = 86995225.97... / e-class command e(N) = 73 STRINGS select columns 1 & 3 of matrix b & store in new matrix x r(sd) = 2949.49... is called e(k_eq) = 1 over different arguments: ... e(rank) = 1 sysuse "auto.dta", clear ad1 textfile.txt foreach x in auto.dta auto2.dta { same as... mat2txt, matrix( ) saving( ) replace tab rep78, missing ssc install mat2txt sysuse "`x'", clear export a matrix to a text file generate p_mean = r(mean) generate meanN = e(N) tab rep78, missing sysuse "auto2.dta", clear create a new variable equal to create a new variable equal to } tab rep78, missing DISPLAYING & DELETING BUILDING BLOCKS average of price obs. in estimation command LISTS [scalar | matrix | macro | estimates] [list | drop] b foreach x in "Dr. Nick" "Dr. Hibbert" { display length("Dr. Nick") preserve create a temporary copy of active dataframe list contents of object b or drop (delete) object b set restore points display length ( "` x '" ) display length("Dr. Hibbert") restore restore temporary copy to point last preserved to test code that } [scalar | matrix | macro | estimates] dir changes data When calling a command that takes a string, list all defined objects for that class ACCESSING ESTIMATION RESULTS surround the macro name with quotes. b x1 VARIABLES matrix list matrix dir scalar drop After you run any estimation command, the results of the estimates are list contents of matrix b list all matrices delete scalar x1 foreach x in mpg weight { • foreach in takes any list stored in a structure that you can save, view, compare, and export. summarize `x' as an argument with } elements separated by summarize mpg public or private variables storing text regress price weight Use estimates store must define list type 3 Macros spaces summarize weight estimates store est1 to compile results foreach x of varlist mpg weight { • foreach of requires you GLOBALS available through Stata sessions PUBLIC store previous estimation results est1 in memory for later use summarize `x' to state the list type, } which makes it faster global pathdata "C:/Users/SantasLittleHelper/Stata" eststo est2: regress price weight mpg ssc install estout define a global variable called pathdata eststo est3: regress price weight mpg foreign FORVALUES: REPEAT COMMANDS OVER LISTS OF NUMBERS add a $ before calling a global macro fit two regression models and store estimation results cd $pathdata iterator Use display command to change working directory by calling global macro estimates table est1 est2 est3 forvalues i = 10(10)50 { show the iterator value at display 10 print a table of the two estimation results est1 and est2 display `i' numeric values over each step in the loop display 20 global myGlobal price mpg length } which loop will run ... summarize $myGlobal ITERATORS summarize price mpg length using global EXPORTING RESULTS i = 10/50 10, 11, 12, ... The estout and outreg2 packages provide numerous flexible options for making tables i = 10(10)50 10, 20, 30, ... LOCALS available only in programs, loops, or do-files PRIVATE after estimation commands. See also putexcel and putdocx commands. DEBUGGING CODE i = 10 20 to 50 10, 20, 30, ... local myLocal price mpg length esttab est1 est2, se star(* 0.10 ** 0.05 *** 0.01) label set trace on (off) see also capture and scalar _rc create local variable called myLocal with the create summary table with standard errors and labels trace the execution of programs for error checking strings price, mpg, and length using “auto_reg.txt”, replace plain se add a ` before and a ' after local macro name to call esttab sysuse auto, clear summarize `myLocal' export summary table to a text file, include standard errors PUTTING IT ALL TOGETHER summarize contents of local myLocal generate car_make = word(make, 1) pull out the first word levelsof rep78, local(levels) outreg2 [est1 est2] using “auto_reg2.txt”, see replace from the make variable export summary table to a text file using outreg2 syntax levelsof car_make, local(cmake) calculate unique groups of create a sorted list of distinct values of rep78, define the car_make and store in local cmake store results in a local macro called levels local i to be local i = 1 Additional programming resources an iterator store the length of local local varLab: variable label foreign can also do with value labels local cmake_len : word count `cmake' cmake in local cmake_len store the variable label for foreign in the local varLab bit.ly/statacode foreach x of local cmake { download all examples from this cheat sheet in a do-file TEMPVARS & TEMPFILES special locals for loops/programs display in yellow "Make group `i' is `x'" ado update adolist ssc install adolist initialize a new temporary variable called temp1 if `i' == `cmake_len' { tempvar temp1 Update user-written ado-files List/copy user-written ado-files tests the position of the save squared mpg values in temp1 display "The total number of groups is `i'" generate `temp1' = mpg^2 net install package, from (https://raw.githubusercontent.com/username/repo/master) iterator, executes contents summarize the temporary variable temp1 in brackets when the summarize `temp1' install a package from a Github repository condition is true } tempfile myAuto create a temporary file to see also https://github.com/andrewheiss/SublimeStataEnhanced local i = `++i' increment iterator by one save `myAuto' be used within a program tempname configure Sublime text for Stata 11–15 } Tim Essam ([email protected]) • Laura Hughes ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated May 2021 follow us @StataRGIS and @flaneuseks Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0 Data Processing Basic syntax with Stata Cheat Sheet All Stata commands have the same format (syntax): For more info, see Stata’s reference manual (stata.com) [by varlist1:] command [varlist2] [=exp] [if exp] [in range] [weight] [using filename] [,options] apply the function: what are column to save output as condition: only apply to apply pull data from a file special options Useful shortcuts command across you going to do apply a new variable apply the function specific rows weights (if not loaded) for command each unique to varlists? command to if something is true combination of F2 keyboard buttons Ctrl + 9 variables in In this example, we want a detailed summary bysort rep78 : summarize price if foreign == 0 & price <= 9000, detail with stats like kurtosis, plus mean and median describe data open a new do-file varlist1 Ctrl + 8 Ctrl + D To find out more about any command–like what options it takes–type help command open the data editor highlight text in do-file, clear then ctrl + d executes it delete data in memory in the command line Basic data operations Change data types AT COMMAND PROMPT Arithmetic Logic == tests if something is equal Stata has six data types, and data can also be missing: = assigns a value to a variable no data true/false words numbers add (numbers) & and == equal < less than PgUp PgDn scroll through previous commands + combine (strings) missing byte string int long float double ! or ~ not != not <= less than or equal to To convert between numbers & strings: − subtract or > greater than gen foreignString = string(foreign) "1" Tab autocompletes variable name after typing part | or ~= equal >= greater or equal to 1 tostring foreign, gen(foreignString) "1" multiply decode foreign , gen(foreignString) "foreign" cls clear the console (where results are displayed) * if foreign != 1 & price >= 10000 if foreign != 1 | price >= 10000 make foreign price make foreign price / divide Chevy Colt 0 3,984 Chevy Colt 0 3,984 gen foreignNumeric = real(foreignString) "1" Set up Buick Riviera 0 10,372 Buick Riviera 0 10,372 1 destring foreignString, gen(foreignNumeric) "1" Honda Civic 1 4,499 Honda Civic 1 4,499 pwd ^ raise to a power Volvo 260 1 11,995 Volvo 260 1 11,995 encode foreignString, gen(foreignNumeric) "foreign" print current (working) directory recast double mpg cd "C:\Program Files\Stata16" Explore data generic way to convert between types change working directory dir VIEW DATA ORGANIZATION SEE DATA DISTRIBUTION Summarize data describe make price codebook make price display filenames in working directory include missing values create binary variable for every rep78 display variable type, format, overview of variable type, stats, value in a new variable, repairRecord dir *.dta and any value/variable labels number of missing/unique values List all Stata data in working directory underlined parts tabulate rep78, mi gen(repairRecord) count summarize make price mpg one-way table: number of rows with each value of rep78 capture log close are shortcuts – use "capture" count if price > 5000 print summary statistics tabulate rep78 foreign, mi close the log on any existing do-files or "cap" number of rows (observations) (mean, stdev, min, max) two-way table: cross-tabulate number of observations log using "myDoFile.txt", replace can be combined with logic for variables for each combination of rep78 and foreign create a new log file to record your work and results ds, has(type string) inspect mpg bysort rep78: tabulate foreign search mdesc lookfor "in." show histogram of data and for each value of rep78, apply the command tabulate foreign packages contain search for variable types, number of missing or zero price weight mpg foreign mean sd n find the package mdesc to install extra commands that tabstat , by( ) stat( ) variable name, or variable label observations create compact table of summary statistics ssc install mdesc expand Stata’s toolkit isid mpg histogram mpg, frequency format numbers install the package mdesc; needs to be done once check if mpg uniquely plot a histogram of the table foreign, statistic(mean price) nformat(%9.2f) Import data identifies the data distribution of a variable create a flexible table of summary statistics BROWSE OBSERVATIONS WITHIN THE DATA collapse (mean) price (max) mpg, by(foreign) replaces data sysuse auto, clear for many examples, we Missing values are treated as the largest calculate mean price & max mpg by car type (foreign) load system data (auto data) use the auto dataset. browse or Ctrl + 8 positive number. To exclude missing values, ask whether the value is less than "." use "yourStataFile.dta", clear open the data editor Create new variables list make price if price > 10000 & !missing(price) clist ... (compact form) generate mpgSq = mpg^2 gen byte lowPr = price < 4000 load a dataset from the current directory frequently used commands are list the make and price for observations with price > $10,000 create a new variable. Useful also for creating binary import excel "yourSpreadsheet.xlsx", /* highlighted in yellow display price[4] variables based on a condition (generate byte) */ sheet("Sheet1") cellrange(A2:H11) firstrow display the 4th observation in price; only works on single values generate id = _n bysort rep78: gen repairIdx = _n import delimited "yourFile.csv", /* _n creates a running index of observations in a group gsort price mpg (ascending) gsort –price –mpg (descending) */ rowrange(2:11) colrange(1:8) varnames(2) generate totRows = _N bysort rep78: gen repairTot = _N sort in order, first by price then miles per gallon import "yourSASfile.sas7bdat", bcat("value labels file") _N creates a running count of the total observations per group see help import for duplicates report assert price!=. pctile mpgQuartile = mpg, nq(4) import "yourSPSSfile.sav" more options finds all duplicate values in each variable verify truth of claim webuse set "https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data" create quartiles of the mpg data webuse "wb_indicators_long" levelsof rep78 egen meanPrice = mean(price), by(foreign) see help egen set web-based directory and load data from the web display the unique values for rep78 calculate mean price for each group in foreign for more options

Tim Essam ([email protected]) • Laura Hughes ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated May 2021 follow us @StataRGIS and @flaneuseks Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0 Data Transformation Reshape data Manipulate strings with Stata Cheat Sheet webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data GET STRING PROPERTIES For more info, see Stata’s reference manual (stata.com) webuse "coffeeMaize.dta" load demo dataset display length("This string has 29 characters") MELT DATA (WIDE → LONG) return the length of the string Select parts of data (subsetting) reshape variables starting unique id create new variable that captures charlist make * user-defined package with coffee and maize variable (key) the info in the column names SELECT SPECIFIC COLUMNS display the set of unique characters within a string drop make reshape long coffee@ maize@, i(country) j(year) new variable display strpos("Stata", "a") convert a wide dataset to long return the position in Stata where a is first found remove the 'make' variable TIDY DATASETS keep make price WIDE LONG (TIDY) have each obser- FIND MATCHING STRINGS country year coffee maize opposite of drop; keep only variables 'make' and 'price' country coffee coffee maize maize melt display strmatch("123.89", "1??.?9") 2011 2012 2011 2012 Malawi 2011 vation in its own FILTER SPECIFIC ROWS Malawi Malawi 2012 row and each return true (1) or false (0) if string matches pattern Rwanda Rwanda 2011 variable in its own drop if mpg < 20 drop in 1/4 Uganda Rwanda 2012 display substr("Stata", 3, 5) drop observations based on a condition (left) cast Uganda 2011 column. Uganda 2012 return string of 5 characters starting with position 3 or rows 1–4 (right) CAST DATA (LONG → WIDE) When datasets are list make if regexm(make, "[0-9]") 1/30 keep in what will be create new variables tidy, they have a list observations where make matches the regular opposite of drop; keep only rows 1–30 create new variables named unique id with the year added consistent, expression (here records that contain a number) keep if inrange(price, 5000, 10000) coffee2011, maize2012... variable (key) to the column name standard format that is easier to list if regexm(make, "(Cad.|Chev.|Datsun)") keep values of price between $5,000–$10,000 (inclusive) reshape wide coffee maize, i(country) j(year) return all observations where make contains make, "Honda Accord", "Honda Civic", "Subaru" manipulate and keep if inlist( ) convert a long dataset to wide analyze. "Cad.", "Chev." or "Datsun" keep the specified values of make compare the given list against the first word in make xpose, clear varname sample 25 transpose rows and columns of data, clearing the data and saving list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun") sample 25% of the observations in the dataset old column names as a new variable called "_varname" return all observations where the first word of the (use set seed # command for reproducible sampling) make variable contains the listed words Replace parts of data Combine data TRANSFORM STRINGS display regexr("My string", "My", "Your") CHANGE COLUMN NAMES ADDING (APPENDING) NEW DATA see help frames for using replace string1 ("My") with string2 ("Your") multiple datasets rename (rep78 foreign) (repairRecord carType) id blue pink replace make = subinstr(make, "Cad.", "Cadillac", 1) id blue pink webuse coffeeMaize2.dta, clear rename one or multiple variables load demo data replace first occurrence of "Cad." with Cadillac should save coffeeMaize2.dta, replace in the make variable CHANGE ROW VALUES contain webuse coffeeMaize.dta, clear the same display stritrim(" Too much Space") replace price = 5000 if price < 5000 + variables "coffeeMaize2.dta" filenum id blue pink append using , gen( ) replace consecutive spaces with a single space replace all values of price that are less than $5,000 with 5000 (columns) add observations from "coffeeMaize2.dta" to display trim(" leading / trailing spaces ") recode price (0 / 5000 = 5000) current data and create variable "filenum" to remove extra spaces before and after a string change all prices less than 5000 to be $5,000 track the origin of each observation display strlower("STATA should not be ALL-CAPS") recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2) MERGING TWO DATASETS TOGETHER webuse ind_age.dta, clear change string case; see also strupper, strproper change the values and value labels then store in a new must contain a save ind_age.dta, replace variable, foreign2 common variable ONE-TO-ONE webuse ind_ag.dta, clear display strtoname("1Var name") id blue pink (id) id brown id blue pink brown _merge convert string to Stata-compatible variable name REPLACE MISSING VALUES 3 merge 1:1 id using "ind_age.dta" 3 display real("100") mvdecode _all, mv(9999) useful for cleaning survey datasets + = one-to-one merge of "ind_age.dta" 3 convert string to a numeric or missing value replace the number 9999 with missing value in all variables into the loaded dataset and create variable "_merge" to track the origin mvencode _all, mv(9999) useful for exporting data MANY-TO-ONE id blue pink id brown id blue pink brown _merge Save & export data replace missing values with the number 9999 for all variables webuse hh2.dta, clear 3 save hh2.dta, replace compress 3 webuse ind2.dta, clear compress data in memory Label data + = . 1 save "myData.dta", replace Stata 12-compatible file _merge code 3 hid "hh2.dta" Value labels map string descriptions to numbers. They allow the 1 row only 3 merge m:1 using saveold "myData.dta", replace version(12) (master) in ind2 many-to-one merge of "hh2.dta" underlying data to be numeric (making logical tests simpler) 2 row only . 1 save data in Stata format, replacing the data if while also connecting the values to human-understandable text. (using) in hh2 . . 2 into the loaded dataset and create a file with same name exists 3 row in label define myLabel 0 "US" 1 "Not US" (match) both variable "_merge" to track the origin export excel "myData.xls", /* label values foreign myLabel FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID */ firstrow(variables) replace define a label and apply it to the values in foreign export data as an Excel file (.xls) with the reclink match records from different data sets using probabilistic matching ssc install reclink variable names as the first row label list note: data note here jarowinkler create distance measure for similarity between two strings ssc install jarowinkler export delimited "myData.csv", delimiter(",") replace list all labels within the dataset place note in dataset export data as a comma-delimited file (.csv)

Tim Essam ([email protected]) • Laura Hughes ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated May 2021 follow us @StataRGIS and @flaneuseks Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0 BASIC PLOT SYNTAX: variables: y first plot-specific options facet annotations Data Visualization graph y1 y2 … yn x [in] [if], by(var) xline(xint) yline(yint) text(y x "annotation") with Stata Cheat Sheet titles axes For more info, see Stata’s reference manual (stata.com) title("title") subtitle("subtitle") xtitle("x-axis title") ytitle("y axis title") xscale(range(low high) log reverse off noline) yscale()

ONE VARIABLE sysuse auto, clear custom appearance plot size save scheme(s1mono) play(customTheme) xsize(5) ysize(4) saving("myPlot.gph", replace) CONTINUOUS histogram mpg, width(5) freq kdensity kdenopts(bwidth(5)) TWO+ CONTINUOUS VARIABLES histogram

bin(#) • width(#) • density • fraction • frequency • percent • addlabels y1 graph matrix mpg price weight, half twoway pcspike wage68 ttl_exp68 wage88 ttl_exp88 addlabopts() • normal • normopts() • kdensity scatterplot of each combination of variables Parallel coordinates plot (sysuse nlswide1) kdenopts() y2 half • jitter(#) • jitterseed(#) vertical, • horizontal y kdensity mpg, bwidth(3) 3 diagonal • [aweights()] smoothed histogram main plot-specific options; twoway pccapsym wage68 ttl_exp68 wage88 ttl_exp88 bwidth • kernel( twoway scatter mpg weight, jitter(7) Slope/bump plot normal • normopts() see help for complete set (sysuse nlswide1) scatterplot vertical • horizontal • headlabel jitter(#) • jitterseed(#) • sort • cmissing(yes | no) DISCRETE connect() • [aweight()] graph bar (count), over(foreign, gap(*0.5)) intensity(*0.5) THREE VARIABLES bar plot graph hbar draws horizontal bar charts (asis) • (percent) • (count) • over(, ) • cw •missing • nofill • allcategories • 20 scatterplot with labelled values 3D contour plot percentages • stack • bargap(#) • intensity(*#) • yalternate • xalternate 17 jitter(#) • jitterseed(#) • sort • cmissing(yes | no) ccuts(#s) • levels(#) • minmax • crule(hue | chue | intensity | linear) • 2 10 connect() • [aweight()] scolor() • ecolor () • ccolors() • heatmap graph bar (percent), over(rep78) over(foreign) interp(thinplatespline | shepard | none) grouped bar plot graph hbar ... (asis) • (percent) • (count) • over(, ) • cw •missing • nofill • allcategories • twoway connected mpg price, sort(price) regress price mpg trunk weight length turn, nocons scatterplot with connected lines and symbols a b c percentages • stack • bargap(#) • intensity(*#) • yalternate • xalternate matrix regmat = e(V) ssc install plotmatrix jitter(#) • jitterseed(#) • sort see also line plotmatrix, mat(regmat) color(green) DISCRETE X, CONTINUOUS Y connect() • cmissing(yes | no) heatmap mat() • color() • freq graph bar (median) price, over(foreign) graph hbar ... bar plot (asis) • (percent) • (count) • (stat: mean median sum min max ...) twoway area mpg price, sort(price) SUMMARY PLOTS over(, )>) • cw • missing • nofill • allcategories • percentages twoway mband mpg weight || scatter mpg weight sort • cmissing(yes | no) • vertical, • horizontal plot median of the y values stack • bargap(#) • intensity(*#) • yalternate • xalternate base(#) bands(#) graph dot (mean) length headroom, over(foreign) m(1, ms(S)) dot plot (asis) • (percent) • (count) • (stat: mean median sum min max ...) over(, )>) • cw • missing • nofill • allcategories • percentages bar plot plot a single value (mean or median) for each x value linegap(#) • marker(#, ) • linetype(dot | line | rectangle) vertical, • horizontal • base(#) • barwidth(#) dots() • lines() • rectangles() • rwidth medians • nquantiles(#) • discrete • controls() • linetype(lfit | qfit | connect | none) • aweight[] graph hbox mpg, over(rep78, descending) by(foreign) missing box plot graph box draws vertical boxplots twoway dot mpg rep78 FITTING RESULTS over(, )>) • missing • allcategories • intensity(*#) • boxgap(#) dcolor() • dfcolor() • dlcolor() medtype(line | line | marker) • medline() • medmarker() calculate and plot linear fit to data with confidence intervals dsize() • dsymbol() level(#) • stdp • stdf • nofit • fitplot() • ciplot() • vioplot price, over(foreign) ssc install vioplot dlwidth() • dotextend(yes | no) range(# #) • n(#) • atobs • estopts() • predopts() over(, ) • nofill • vertical • horizontal • obs • kernel() • bwidth(#) • twoway dropline mpg price in 1/5 twoway lowess mpg weight || scatter mpg weight barwidth(#) • dscale(#) • ygap(#) • ogap(#) • density() calculate and plot lowess smoothing bar() • median() • obsopts() dropped line plot vertical, • horizontal • base(#) bwidth(#) • mean • noweight • logit • adjust

Plot placement twoway qfitci mpg weight, alwidth(none) || scatter mpg weight JUXTAPOSE (FACET) twoway rcapsym length headroom price calculate and plot quadriatic fit to data with confidence intervals range plot (y ÷ y ) with capped lines level(#) • stdp • stdf • nofit • fitplot() • ciplot() • 1 2 range(# #) • n(#) • atobs • estopts() • predopts() twoway scatter mpg price, by(foreign, norescale) vertical • horizontal total • missing • colfirst • rows(#) • cols(#) • holes() see also rcap compact • [no]edgelabel • [no]rescale • [no]yrescal • [no]xrescale REGRESSION RESULTS [no]iyaxes • [no]ixaxes • [no]iytick • [no]ixtick • [no]iylabel [no]ixlabel • [no]iytitle • [no]ixtitle • imargin() regress price mpg headroom trunk length turn twoway rarea length headroom price, sort coefplot, drop(_cons) xline(0) ssc install coefplot SUPERIMPOSE Plot regression coefficients range plot (y1 ÷ y2) with area shading graph combine plot1.gph plot2.gph... vertical • horizontal • sort baselevels • b() • at() • noci • levels(#) cmissing(yes | no) keep() • drop() • rename() combine two or more saved graphs into a single plot horizontal • vertical • generate() scatter y3 y2 y1 x, msymbol(i o i) mlabel(var3 var2 var1) regress mpg weight length turn plot several y values for a single x value twoway rbar length headroom price margins, eyex(weight) at(weight = (1800(200)4800)) range plot (y ÷ y ) with bars graph twoway scatter mpg price in 27/74 || scatter mpg price /* 1 2 marginsplot, noci */ if mpg < 15 & price > 12000 in 27/74, mlabel(make) m(i) vertical • horizontal • barwidth(#) • mwidth Plot marginal effects of regression msize() horizontal • noci combine twoway plots using || Laura Hughes ([email protected]) • Tim Essam ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated May 2021 follow us @flaneuseks and @StataRGIS Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0 ANATOMY OF A PLOT Apply themes Plotting in Stata annotation title titles subtitle Schemes are sets of graphical parameters, so you don’t Customizing Appearance have to specify the look of the graphs every time. For more info, see Stata’s reference manual (stata.com) 200 plots contain many features marker label y-axis 10 1 USING A SAVED THEME graph region 8

150 line inner graph region 9 5 twoway scatter mpg price, scheme(customTheme) inner plot region y-axis title 4 marker 100 Create custom themes by

y-axis title 6 help scheme entries saving options in a .scheme file 2 7 plot region y-axis labels 50 grid lines see all options for setting scheme properties y-line 3 adopath ++ "~//StataThemes" 0 tick marks outer region inner region 0 20 40 60 80 100 set path of the folder (StataThemes) where custom scatter price mpg, graphregion(fcolor("192 192 192") ifcolor("208 208 208")) .scheme files are saved x-axis title set as default scheme specify the fill of the background in RGB or with a Stata color x-axis y2 legend scatter price mpg, plotregion(fcolor("224 224 224") ifcolor("240 240 240")) Fitted values set scheme customTheme, permanently specify the fill of the plot background in RGB or with a Stata color change the theme

SYMBOLS LINES / BORDERS TEXT net inst brewscheme, from("https://wbuchanan.github.io/brewscheme/") replace marker arguments for the plot line marker axes tick marks marker label titles axis labels install William Buchanan’s package to generate custom objects (in green) go in the schemes and color palettes (including ColorBrewer) options portion of these xline(...) options> yscale(...) options> subtitle(...) ylabel(...) USING THE GRAPH EDITOR commands (in ) yline(...) xlabel(...) xtitle(...)

SYNTAX legend annotation legend for example: ylabel(...) ytitle(...) scatter price mpg, xline(20, lwidth(vthick)) legend(region(...)) text(...) legend(...) twoway scatter mpg price, play(graphEditorTheme)

mcolor("145 168 208") mcolor(none) lcolor("145 168 208") lcolor(none) color("145 168 208") color(none) specify the fill and stroke of the marker specify the stroke color of the line or border specify the color of the text Select the in RGB or with a Stata color marker "145 168 208" marker label "145 168 208" mlcolor( ) mlabcolor( ) Graph Editor mfcolor("145 168 208") mfcolor(none) tick marks tlcolor("145 168 208") axis labels labcolor("145 168 208")

COLOR specify the fill of the marker adjust transparency by adding %# grid lines glcolor("145 168 208") mcolor("145 168 208 %20")

msize(medium) specify the marker size: marker size(medsmall) specify the size of the text: lwidth(medthick) mlwidth(thin) Click specify the thickness tick marks tlwidth(thin) marker label mlabsize(medsmall) (stroke) of a line: Record ehuge medlarge grid lines glwidth(thin) axis labels labsize(medsmall)

medium vvvthick medthin vhuge Text medsmall vhuge Text Text small Double-click on medsmall vvthick thin huge Text vsmall symbols and areas small vthick vthin Text huge Text vlarge Text tiny on plot, or regions vsmall thick vvthin Text half_tiny on sidebar to SIZE / THICKNESSS Text large Text vlarge tiny medthick vvvthin third_tiny customize Text medlarge Text quarter_tiny large vtiny medium none Text medium Text minuscule Unclick Record msymbol(Dh) specify the marker symbol: line axes lpattern(dash) specify the marker label mlabel(foreign) line pattern Save theme dash label the points with the values grid lines glpattern( ) of the foreign variable as a .grec file O D T S solid longdash longdash_dot axis labels nolabels o d t s dash shortdash shortdash_dot no axis labels Save plots axis labels format(%12.2f ) graph twoway scatter y x, saving("myPlot.gph") replace Oh Dh Th Sh dot dash_dot blank change the format of the axis labels save the graph when drawing APPEARANCE oh dh th sh axes noline axes off no axis/labels legend off graph save "myPlot.gph", replace turn off legend tick marks noticks tick marks tlength(2) save current graph to disk + X p none i legend label(# "label") grid lines nogrid nogmin nogmax change legend label text graph combine plot1.gph plot2.gph... combine two or more saved graphs into a single plot jitter(#) jitterseed(#) tick marks xlabel(#10, tposition(crossing)) marker label mlabposition(5) graph export "myPlot.pdf", as(.pdf) see options to set size and resolution randomly displace the markers set seed number of tick marks, position (outside | crossing | inside) label location relative to marker (clock position: 0 – 12) export the current graph as an image file POSITION Laura Hughes ([email protected]) • Tim Essam ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated May 2021 follow us @flaneuseks and @StataRGIS Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0