Preparing Data for Analysis in Stata Before You Can Analyse Your Data, You Need to Get Your Data Into an Appropriate Format, to Enable Stata to Work for You
Total Page:16
File Type:pdf, Size:1020Kb
SIDM2 – Preparing data for analysis Preparing Data for Analysis in Stata Before you can analyse your data, you need to get your data into an appropriate format, to enable Stata to work for you. To avoid rubbish results, you need to check your data is sensible and free of nonsense values. Recap on preceding workshop, SIDM1 It is essential to create and save a do file of commands, otherwise you will lose your work. Attention needs to be paid to the presence of missing data, which are treated as infinitely large positive values by Stata. You learnt to run commands from this do file as you go along. You also learn how to open and save datasets, log files and graphs. You learn to distinguish between different types of data. You learnt to use variable labels for a fuller description of what they are, and value labels to define categories (when coded numerically, so we know what the numbers represent). You learn basic graph and tables commands, how to create new variables and amend their values, use of if statement. Learning objectives of this Session, SIDM2 This describes how to read in data Stata in the first place. There is a recap on different types of data, and rationale on need to change some variables between formats before analysis can begin, in many cases. This gives example code, and exercises (with solutions available) for you to see how these commands are used in practice. Learning objectives of further workshops, SIDM3 and SIDM4. Merging datasets in many different ways, reshaping datasets, looping in Stata, and extracting saved results into files. Efficient production of publication quality tables. Further resources complementary to this series This series teaches most of the material contained in Stata Data Management.doc, referenced SDM. The accompanying Stata commands crib sheet.xls, SCCS, acts as a quick reference guide (and also summarises some data analysis commands). Stata manuals (accessed online and via help) and Stata help itself, are both excellent resources. The manuals teach statistics, as well as Stata, and provide statistics references. Contents 1. Reading data into Stata from other files ................................................................................................................... 2 2. Recap on types of data .............................................................................................................................................. 2 3. Converting strings to numeric and categorical data as necessary ............................................................................ 3 4. Dealing with Dates .................................................................................................................................................... 4 5. Checking for errors and missing data ....................................................................................................................... 4 6. When your dataset erroneously has 2 or more lines of data for a few patients ...................................................... 5 7. Recoding numeric data into groups .......................................................................................................................... 5 8. Extracting information from string variables ............................................................................................................ 7 9. Further sources of help ............................................................................................................................................. 7 SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops SCCS=Stata Commands Crib Sheet.xls 2.1 SIDM2 – Preparing data for analysis 1. Reading data into Stata from other files Now suppose that the census data we looked at last week was received as an excel file, that we need to read into Stata before we can analyse it. See SDM 2.3 Opening data from an excel file. See SDM chapter 2 for reading in from other sources. cd "H:\_MPHTeaching\stata\dataman\wk2" clear import excel census, firstrow /* reading data in from excel file census. Xls, firstrow indicates that the first row is treated as variable names or labels */ descr // look for string variables, storage type = str## browse // look for string variables which appear in red Student Exercise: Read in nlsw88dates2.xls into Stata. View the data, looking at types of variables and seeing what is contains. Summarise the data. Here is some further information on what is contained in this data set, with commands for labelling appropriately: label var grade "Current grade completed" label var c_city "Lives in central city" label var wage "Hourly Wage" label var south "Lives in South" label var union "Union Worker" label var hours "Usual hours worked" label var ttl_exp "Total Work Experience" label var tenure "Job Tenure (years)" label var quesday "Day of month that questionnaire was filled in (started to be filled in)" label var quesmon "Month that questionnaire was filled in (started to be filled in)" label var quesyr "Year that questionnaire was filled in (started to be filled in)" label var quesfinish "Date that questionnaire was completed" 2. Recap on types of data There are 4 main types of data in Stata: i) numeric (numerical with types int, byte, float, double – black in data browser) ii) string (e.g. str2, str24 – red in data browser) iii) categorical (i.e. numeric with value labels - blue in data browser) iv) dates & times (numeric with format %d or %td or similar – black in data browser). The describe command will detail data types, format, presence of value labels and variable labels. It is usually necessary to have data in numeric format, in order to use it in Stata data analysis and most graph commands; this includes dates & time in Stata format recognised as such (showing up in black in the data editor) and categorical data (showing as blue and looking like text). The main exception to this is patient id variable (or hospital id’s or regions or similar) where it is generally okay to use a string variable. SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops SCCS=Stata Commands Crib Sheet.xls 2.2 SIDM2 – Preparing data for analysis Dates and categorical data are often read into Stata as string variables, this is generally the best way to do it. Hence the need to recode into different types of variables. There may also be a desire to recode numeric data into categories, and to recode categorical variables, perhaps by combining categories. 3. Converting strings to numeric and categorical data as necessary See DSM 5.5 and 5.6 Converting strings to numeric data and categorical data *** converting from string variables to numeric variables (when most/ all values already look like strings) destring medage2, replace // converts string to numeric data, keeping the same variable name tab medage2 medage, miss // check new variable against another variable that looks the same scatter medage2 medage // they are identical summ medage medage2 // identical also in number of missing values list if medage2==. // identical also in missing values, they are for the same observation drop medage2 // we don't need 2 identical variables destring marriage, replace // gives error because there is non-numeric data help destring destring marriage, force replace /* converts string var marriage into numeric var, with missing value where there is any non-numeric data */ *** converting from strings to categorical variables descr encode region, gen(region2) // create a categorical (numeric) var (region2) from the string var, region tab region region2 // compare the newly created and original variables tab region region2, nolabel // compare the newly created and original vars without value labels codebook region2 // see correspondence of values and value labels drop region // the string version is nolonger needed encode state2, gen(state_2) // create categorical var (state_2) from string var, state2 tab state2 state_2 // compare - browse state2 state_2 // easier to compare this way codebook state_2 // gives examples of coding label list state_2 // gives the full numeric correspondence between numbers and value labels drop state2 // string version is no-longer needed Student Exercise: a) Look for string variables in nlsw88dates2.xls that look numeric/ as if they should be numeric. Create numeric version of (one or more of) the variables. b) Look for categorical variables and convert (some of) them also to numeric variables. For instance, do this for industry and race. Does encode command work well for both/all? If not, then what approach shall we take? c) Try help string function and decide whether it is a good strategy to use one or a few string functions to tidy up the variables before using the encode command. For instance, could take just the first character and change to lower case. SDM=Stata Data Management.doc Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops SCCS=Stata Commands Crib Sheet.xls 2.3 SIDM2 – Preparing data for analysis 4. Dealing with Dates See SDM chapter 7 on dates. *** convering to Stata dates variables from string variables gen dateofsurvey3=date( dateofsurvey, "DMY") /* converts from string to Stata date variable, string ordered day month year DMY */ browse dateofsurvey dateofsurvey3 // dates are coded as number of days from a fixed date format dateofsurvey3 %d /* display the date variable in a format that we can understand as a date (not a number) */ browse dateofsurvey dateofsurvey3 //