Data Processing
Total Page:16
File Type:pdf, Size:1020Kb
Data Processing Basic Syntax with Stata 14.1 Cheat Sheet All Stata functions have the same format (syntax): For more info see Stata’s reference manual (stata.com) [by varlist1:] command [varlist2] [=exp] [if exp] [in range] [weight] [using filename] [,options] apply the function: what are column to save output as condition: only apply to apply pull data from a file special options Useful Shortcuts command across you going to do apply a new variable apply the function specific rows weights (if not loaded) for command each unique to varlists? command to if something is true combination of F2 keyboard buttons Ctrl + 9 variables in In this example, we want a detailed summary with stats like kurtosis, plus mean and median describe data open a new .do file varlist1 bysort rep78 : summarize price if foreign == 0 & price <= 9000, detail Ctrl + 8 Ctrl + D To find out more about any command – like what options it takes – type help command open the data editor highlight text in .do file, clear then ctrl + d executes it delete data in memory in the command line Basic Data Operations Change Data Types AT COMMAND PROMPT Arithmetic Logic == tests if something is equal Stata has 6 data types, and data can also be missing: = assigns a value to a variable no data true/false words numbers add (numbers) & and == equal < less than missing byte string int long float double PgUp PgDn scroll through previous commands + combine (strings) ! or ~ not != not <= less than or equal to To convert between numbers & strings: − subtract or > greater than gen foreignString = string(foreign) Tab autocompletes variable name after typing part | or ~= equal "1" >= greater or equal to 1 tostring foreign, gen(foreignString) "1" * multiply decode foreign , gen(foreignString) "foreign" cls clear the console (where results are displayed) if foreign != 1 & price >= 10000 if foreign != 1 | price >= 10000 make foreign price make foreign price gen foreignNumeric = real(foreignString) "1" Set up / divide Chevy Colt 0 3,984 Chevy Colt 0 3,984 Buick Riviera 0 10,372 Buick Riviera 0 10,372 1 destring foreignString, gen(foreignNumeric) "1" Honda Civic 1 4,499 Honda Civic 1 4,499 encode , gen( ) pwd ^ raise to a power Volvo 260 1 11,995 Volvo 260 1 11,995 foreignString foreignNumeric "foreign" print current (working) directory recast double mpg cd "C:\Program Files (x86)\Stata13" Explore Data generic way to convert between types change working directory dir VIEW DATA ORGANIZATION SEE DATA DISTRIBUTION Summarize Data describe make price codebook make price display filenames in working directory include missing values create binary variable for every rep78 display variable type, format, overview of variable type, stats, value in a new variable, repairRecord fs *.dta and any value/variable labels number of missing/unique values List all Stata data in working directory underlined parts tabulate rep78, mi gen(repairRecord) count summarize make price mpg one-way table: number of rows with each value of rep78 capture log close are shortcuts – use "capture" count if price > 5000 print summary statistics tabulate rep78 foreign, mi close the log on any existing do files or "cap" number of rows (observations) (mean, stdev, min, max) two-way table: cross-tabulate number of observations log using "myDoFile.txt", replace Can be combined with logic for variables for each combination of rep78 and foreign create a new log file to record your work and results ds, has(type string) inspect mpg bysort rep78: tabulate foreign search mdesc lookfor "in." show histogram of data, for each value of rep78, apply the command tabulate foreign packages contain number of missing or zero find the package mdesc to install search for variable types, tabstat price weight mpg, by(foreign) stat(mean sd n) extra commands that variable name, or variable label observations ssc install mdesc expand Stata’s toolkit create compact table of summary statistics displays stats isid mpg histogram mpg, frequency formats numbers for all data install the package mdesc; needs to be done once check if mpg uniquely plot a histogram of the table foreign, contents(mean price sd price) f(%9.2fc) row Import Data identifies the data distribution of a variable create a flexible table of summary statistics BROWSE OBSERVATIONS WITHIN THE DATA collapse (mean) price (max) mpg, by(foreign) replaces data sysuse auto, clear for many examples, we Missing values are treated as the largest calculate mean price & max mpg by car type (foreign) load system data (Auto data) use the auto dataset. browse or Ctrl + 8 positive number. To exclude missing values, Create New Variables use "yourStataFile.dta", clear open the data editor use the !missing(varname) syntax list make price if price > 10000 & !missing(price) clist ... (compact form) generate mpgSq = mpg^2 gen byte lowPr = price < 4000 load a dataset from the current directory frequently used commands are list the make and price for observations with price > $10,000 create a new variable. Useful also for creating binary import excel "yourSpreadsheet.xlsx", /* highlighted in yellow display price[4] variables based on a condition (generate byte) */ sheet("Sheet1") cellrange(A2:H11) firstrow display the 4th observation in price; only works on single values generate id = _n bysort rep78: gen repairIdx = _n import an Excel spreadsheet _n creates a running index of observations in a group gsort price mpg (ascending) gsort –price –mpg (descending) import delimited"yourFile.csv", /* generate totRows = _N bysort rep78: gen repairTot = _N sort in order, first by price then miles per gallon */ rowrange(2:11) colrange(1:8) varnames(2) _N creates a running count of the total observations per group import a .csv file duplicates report assert price!=. pctile mpgQuartile = mpg, nq = 4 finds all duplicate values in each variable verify truth of claim webuse set "https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data" create quartiles of the mpg data webuse "wb_indicators_long" levelsof rep78 egen meanPrice = mean(price), by(foreign) see help egen set web-based directory and load data from the web display the unique values for rep78 calculate mean price for each group in foreign for more options Tim Essam ([email protected]) • Laura Hughes ([email protected]) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated January 2016 follow us @StataRGIS and @flaneuseks Disclaimer: we are not affiliated with Stata. But we like it. CC BY 4.0 Data Transformation Reshape Data Manipulate Strings with Stata 14.1 Cheat Sheet webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data GET STRING PROPERTIES For more info see Stata’s reference manual (stata.com) webuse "coffeeMaize.dta" load demo dataset display length("This string has 29 characters") MELT DATA (WIDE → LONG) return the length of the string Select Parts of Data (Subsetting) reshape variables starting unique id create new variable which captures charlist make * user-defined package with coffee and maize variable (key) the info in the column names SELECT SPECIFIC COLUMNS display the set of unique characters within a string drop make reshape long coffee@ maize@, i(country) j(year) new variable display strpos("Stata", "a") remove the 'make' variable convert a wide dataset to long return the position in Stata where a is first found TIDY DATASETS have keep make price WIDE LONG (TIDY) each observation FIND MATCHING STRINGS country year coffee maize opposite of drop; keep only variables 'make' and 'price' country coffee coffee maize maize melt in its own row and display strmatch("123.89", "1??.?9") 2011 2012 2011 2012 Malawi 2011 FILTER SPECIFIC ROWS Malawi Malawi 2012 each variable in its return true (1) or false (0) if string matches pattern drop if mpg < 20 drop in 1/4 Rwanda Rwanda 2011 own column. display substr("Stata", 3, 5) Uganda cast Rwanda 2012 drop observations based on a condition (left) Uganda 2011 return the string located between characters 3-5 Uganda 2012 or rows 1-4 (right) CAST DATA (LONG → WIDE) When datasets are list make if regexm(make, "[0-9]") keep in 1/30 what will be create new variables tidy, they have a list observations where make matches the regular opposite of drop; keep only rows 1-30 create new variables named unique id with the year added consistent, expression (here, records that contain a number) keep if inrange(price, 5000, 10000) coffee2011, maize2012... variable (key) to the column name standard format that is easier to list if regexm(make, "(Cad.|Chev.|Datsun)") keep values of price between $5,000 – $10,000 (inclusive) reshape wide coffee maize, i(country) j(year) return all observations where make contains keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru") manipulate and convert a long dataset to wide analyze. "Cad.", "Chev." or "Datsun" keep the specified values of make compare the given list against the first word in make xpose, clear varname sample 25 transpose rows and columns of data, clearing the data and saving list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun") sample 25% of the observations in the dataset old column names as a new variable called "_varname" return all observations where the first word of the (use set seed # command for reproducible sampling) make variable contains the listed words Replace Parts of Data Combine Data TRANSFORM STRINGS display regexr("My string", "My", "Your") CHANGE COLUMN NAMES ADDING (APPENDING) NEW DATA replace string1 ("My") with string2 ("Your") rename (rep78 foreign) (repairRecord