Biostatistics and Epidemiology Using Stata: a Course Manual

Biostatistics and Epidemiology Using Stata: A Course Manual Table of Contents

Section 1. Stata: Data Management, Graphics, and Programming

1.1 Installing Stata and recovering Stata windows 1. installing Stata 1. adding an icon to the desktop (PC Windows) 2. run Stata to finish setup 3. updating Stata after setup 3. recoving Windows: load factory settings

1.2 Getting data into Stata and some other basics 1. opening a Stata formatted data file: 1) clicking on file icon 1. showing full dictory path in Windows Explorer 1. showing file extensions in Windows Explorer 2. opening a Stata formatted data file: 2) File icon on menu bar 3. opening a Stata formatted data file: 3) change directory (cd) command directory list (dir) command read in Stata data file (use) command 4. scrolling in Stata’s Results window 5. general syntax, or structure, of Stata commands 6. Stata help facility, help command 7. Stata Manuals 7. books on Stata 7. setting file attributes in Windows (turn of Read Only) 8. using do-files 9. suggested do-file structure 10. increasing memory size of Stata’s workspace: set memory (set mem) command 10. importing Excel file into Stata 11. reading in a *.csv or *.txt formatted file: insheet command 11. saving a Stata formatted data file: save command 12. saving a Stata formatted data file compatable with Stata version 8 or 9: saveold command

1.3 Cleaning data 1. listing data: list command 2. block comment: /* … */ 2. deleting variables: drop command 2. inline comment “//” 3. tabulation of values of variables using frequency table: tabulate (tab) command

1 3. examining 4 smallest and 4 largest values: summarize, detail (sum) command 3. replacing value of variable: replace command assignment “=” and logical equals “==” 4. recoding values of a variable with generate (gen) and replace commands 5. keeping a command from crashing the do-file: capture command, e.g., capture drop 5. the “0 observations” result: attempting to do arithmetic on a string variable 6. describing variables: decribe command 6. variable storage types 7. missing value for string variable: the null string "" 7. converting a string variable to numeric, destring command 8. recoding values of a variable: recode command 9. converting to all upper or lower case: upper and lower string functions 9. renaming the variable name: rename command

1.4 Merging files 1. adding a file to bottom of file in memory: append command 2. adding a file in rightmost columns of file in memory, one-to-one merge without matching on some variable: merge command 3. merging files while matching on some variable such as a subject ID, match merge: merge command 4. checking how well the matching worked: Stata’s _match variable (values 1 to 3) 5. non-overwrite feature of merge command (the default) 6. non-overwrite of missing values feature of merge command (the default) 7. updating file in memory with another file by replacing missing values only: update option 8. updating file in memory with another file by replacing both missing and nonmissing values, update with replace: replace option 8. checking how well the update with replaced worked: Stata’s _match variable (values 1 to 5)

1.5 Labeling variables and values 2. adding label to a variable: label variable command 3. adding labels to the values of a variable: label define and label values commands 3. listing value labels: label list command 4. suspending value labels in data browser and outputs, nolabel option 6. removing variables labels 7. removing value labels: label drop command 8. removing value labels: capture label drop command 9. displaying values and value labels

1.6 Basic graphics 1. using graphs from Stata version 7: graph7 and version 7 commands 1. redisplaying a graph: graph display command 2. scatterplot: graph twoway scatter command 3. appreviated scatterplot commands: twoway scatter and scatter commands

2 3. side by side graph: by option 3. linear regression line graph: lfit command 3. overlaying graphs: “||” operator 3. overlaying linear regression line on scatterplot using || operator 4. overlaying graphs using binding notation: ( ) ( ) 4. overlaying linear regression line on scatterplot using ( ) approach 5. generating a variable with rounding: round function 5. generating a mean across data rows for subgroups: by specification with egen command with mean function 5. listing variables: list command 5. extending a command across several lines in do-file editor: #delimit command 5. line graph: line command 6. requirement to sort on x variable before plotting a line graph: sort command 7. table of descriptive statistics for a two variable crossclassification: table command 7. smooth line graph using fractional polynomial fit: fpfit command 7. fractional polynomial fit with covariates: fracpoly command 8. adding title to graph: title command 8. adding subtitle to graph: subtitle command 8. adding axis titles to graph: ytitle and xtitle commands 8. adding footnote to graph: note command 9. adding more tick marks and labels to axes: ylabel and xlabel commands 9. better labels for legend: legend command 10. list of choices for line graph line widths: graph query linewidthstyle command 10. changing connect line width of line graph: clwidth option 11. list of choices for graph scheme: graph query, schemes command 11. changing default graph scheme for current session or permanently: set scheme command 11. chaning graph scheme just for current graph: scheme option 12. basic black-and-white scheme for manuscripts: scheme(s1mono) option 13. eliminating border around graph: plotregion(style(none)) option 14. adding text to graph: text option 15. placement options for positioning text: placement option 16. adding space between x-axis title and x-axis tick labels: height(5) option in xtitle 17. changing color of connect line of line graph: clcolor option 17. turning off legend: legend(off) option 19. reading in graph data by putting data in do-file: input and end commands 19. adding error bars to graph: rcap command 20. overlaying errors bars on scatterplot to get symbol with error bars: twoway (rcap…) (scatter…) commands 21. adding white space to left and right side of graph: xlabel command 21. change tick mark labels to more descriptive labels: xlabel command 22. drop tick marks from graph while retaining labels: noticks option 23. adding horizontal or vertical reference lines: yline and xline options 24. list of choices for colors: help colorstyle command 24. list of choices for symbols: help symbolstyle command 24. changing marker symbol for scatterplot: msymbol option

3 24. changing color to marker symbol border line and inside fill: mlcolor and mfcolor options

1.7 Looping, collapsing, and reshaping 1.8 Operators, ifs, dates, and times 1.9 More graphics: popular scientific graphs 1.10 Programming Stata 1.11 Compilation of frequently used variable generation and modifying commands (a chapter for quick look up) 1.12 Homework problems

Section 2. Biostatistics

2-1 Describing variables, levels of measurement, and vhoice of descriptive statistics Describing a variable (distribution): with tables: frequency tables with graphs: histogram, boxplots with descriptive statistics: mean, standard deviation, etc. Levels of measurement (nominal, ordinal, ... categorical, continuous ...) How to decide what descriptive statistic to use to describe a variable in the “Table 1. Patient Characteristics” table of an article.

2-2 Logic of significance tests What a probability distribution is Logic of a significance test (same logic as a laboratory reference range) Chance, randomness, sampling variability Statistical regularity (the basis of statistical theory) Strong Law of Large Numbers (formal statement of statistical regularity). Deriving the form of statistical test (significance test) intuitively Sampling distribution p value

2-3 Choice of significance test

2-4 Comparison of two independent groups Role of p values in a Table 1 Patients Characteristics table Confounding variables chi-square test Fisher’s exact test Asymptotic vs exact tests (parametric vs nonparametric tests) Minimum expected frequency rule for choosing between chi-square test and Fisher’s exact test

4 Barnard’s unconditional exact test Fisher-Freeman-Halton test Wilcoxon-Mann-Whitney test Fisher-Pitman Permutation Test for Independent Samples Central Limit Theorem Levene’s test for equality of variances t test (both equal and unequal variances) Shapiro-Wilks test for normality Reporting styles Outliers Prespecification of analysis

2-5 Basics of power analysis definition of power power increases as sample size increases decision errors of significance tests [ Type I error (alpha), Type II error (beta) ] Type II error and sample size paragraph in journal article conclusions of equivalence power of a significance test effect of one- or two-sided comparison on power effect of choice of alpha on power effect of choice of minimum detectable effect size on power effect of size of assumed standard deviation (SD) on power – coming up with a SD estimate effect of sample size on power sample size and power calculations for an interval scaled outcome variable what to do if you don’t know anything (no effect size or standard deviation estimates): the standard deviation units approach, Cohen’s d. sample size calculation when a multiple comparison adjustment is planned overfitting switching the dependent and independent variables sample size based on precision (desired width of confidence interval) excessive power (sample size very large) two group comparison of interval scale outcome sample size paragraph in study protocol

2-6 More on levels of measurement sums of ordinal scales produce interval sacles dichotomous scales are actually interval scales can statistical tests that require interval scales be used with ordinal scales ( the ordinal-interval controversy in statistics)

2-7 Comparison of two paired groups

5 2-8 Multiplicity and the Comparison of 3+ Groups multiplicity multiple comparison problem p value based multiple comparison procedures: family-wise error rate (Bonferroni, Holm, Sidak, Holm-Sidak, Hochberg, Finner, Hommel, Tukey-Ciminera-Heyse) P value based multiple comparison procedures: false discovery rate (Benjamini-Hochberg procedure) how to get away without using multiple-comparison procedures simultaneous comparison of 3+ groups (includes one-way analysis of variance) sample size when multiple comparisons are planned

2-9 Correlation

2-10 Linear regression how linear regression controls for covariates

2-11 Logistic regression and dummy variables linear regression estimates risk difference (difference between proportions), but is criticized because it can estimate predicted probabilities outside of the 0-1 range logistic regression is designed to constrain the predicted probability between 0 and 1 definition of an odds ratio assessing linearity of effect dummy variables (indicator variables)

2-12 Survival analysis: Kaplan-Meier graphs, Log-rank Test, and Cox regression life tables Kaplan-Meier survival probabilies & Kaplan-Meier curves log-rank test Cox regression assessing goodness of fit with c-statistic (ROC area) interpreting the c-statistic testing proportional hazards assumption of Cox regression

2-13 Confidence intervals versus p values and trends toward significance 2-14 Pearson correlation coefficient with clustered data 2-15 Equivalence and noninferiority tests 2-16 Validity and reliability 2-17 Methods comparison studies 2-18 One sample tests 2-19 Homework problems

6 Section 3. Epidemiology

3-1 Introduction to epidemiologic thinking 3-2 Sufficient/component cause theory of disease 3-3 Hill’s causal criteria 3-4 Logic and errors 3-5 Effect measures 3-6 Study designs 3-7 Randomization using Excel 3-8 Bias and confounding 3-9 Random error and statistics 3-10 Crude analysis 3-11 Stratified analysis 3-12 Standardization 3-13 Sensitivity (bias) analysis 3-14 Case-cohort study design 3-15 Homework problems

Section 4. Power Analysis

Chapter 4-1. Sample Size Determination and Power Analysis for Specific Applications two independent group comparison of means (independent groups t test) linear regression: comparing two groups adjusted for covariates two independent groups comparison of dichotomous outcome variable (chi-square test, Fisher’s exact test) two indendpent groups comparison of a nominal outcome variable (chi-square test and Fisher-Freeman-Halton test) two independent groups comparison of ordinal outcome variable (Wilcoxon-Mann- Whitney test) paired ordinal outcome variable (Wilcoxon signed ranks tests) repeated measurements or clustered studies (GEE, mixed, mulilevel, hierarchial models) power analysis using Monte Carlo simulation (independent samples t test) power analysis using Monte Carlo simulation (2 × 2 table chi-square test) power analysis using Monte Carlo simulation (Poisson regression with person-time) power analysis using Monte Carlo simulation (2-way ANOVA, both factors with 2 levels, neither of which is a repeated measurement) logrank test

7 Section 5. Regression Models

5-1 What regression is and curvilinear correlation 5-2 Holding constant 5-3 Dichotomous predictor variables 5-4 Adjusted means, Analysis of Variance (ANOVA), and interaction 5-5 Deriving logistic regression 5-6 Exact logistic regression 5-7 Introducing Cox regression and Kaplan-Meier plots 5-8 Interaction 5-9 Missing data imputation 5-10 Linear regression robust to assumptions 5-11 Linear regression diagnostics and transformations 5-12 Variable selection and collinearity 5-13 Monte Carlo Simulation and Bootstrapping 5-14 Model Validation 5-15 Response feature (summary measure) analysis 5-16 Analysis of covariance (ANCOVA) versus change analysis 5-17 Conditional logistic regression 5-18 Repeated measures analysis of variance 5-19 Generalized estimating equations (GEE) 5-20 Multilevel (mixed effects) models 5-21 Regression post tests 5-22 Modeling cost 5-23 Cox regression proportional hazards assumption 5-24 Cluster analysis 5-25 Multilevel (mixed effects) logistic regression 5-26 Trend tests 5-27 Homework problems

Appendix 1. Dataset Descriptions births.dta Concerns 500 mothers who had singleton births in a large London hospital.

8 evans.dta From a cohort study in which n=609 white males were followed for 7 years, with coronary heart disease as the outcome of interest.

2.20.Framingham.dta The dataset comes from a long-term follow-up study of cardiovascular risk factors on 4699 patients living in the town of Framingham, Massachusetts.

LeeLife.dta Concerns male patients with localized cancer of the rectum diagnosed in Connecticut from 1935 to 1954. The research question is whether survival improved for the 1945-1954 cohort of patients (cohort = 1) relative to the earlier 1935-1944 cohort (cohort = 0). mi.dta From a 1:2 matched case-control study in which n=117 subjects are formed into 39 matched strata. rmr.dta Data published by Nawata et al (2004)(on course CD). The data were created from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors. smoke.csv Concerns 234 smokers who expressed a willingness to quit smoking were followed for one year to estimate the proportion of recidivism (quit for a time and then started again). wright_lowbw.dta The dataset concerns 900 birthweight outcomes and risk factors attributable to the mother. vaso.dta The data were obtained in a carefully controlled study of the effect of the RATE and VOLume of air inspired by human subjects on the occurrence (coded 1) or non-occurrence (coded 0) of a transient vasoconstriction RESPonse in the skin of the fingers.

9 PEPI Windows Programs

Some selected programs from the software package Programs for Epidemiologists (PEPI). The software distribution CD grants permission to share the software without permission, as long as the person sharing it does not charge for it. The manual must be purchases, however, and cannot be shared without permission.

These programs run in a DOS window of the Windows operating system. adjustp.exe Multiple comparison procedures (referred to in Chapter 2-8) Holm’s procedure Hommel’s procedure Finner’s procedure misclass.exe sensitivity analysis for misclassification bias (referred to in Chapter 3-13). powr.exe power analysis for comparing two groups (not referred to in manual) independent proportions (chi-square test) related proportions (McNemar test) ordered categories (Wilcoxon-Mann-Whitney test) independent means (independent groups t test) related sample means (paired t test)\ sample.exe single group sample size determination (not referred to in manual) proportion mean precision of prevalence rate samples.exe two group sample size determination (not referred to in manual) independent proportions (chi-square test) related proportions (McNemar test) ordered categories (Wilcoxon-Mann-Whitney test) independent means (independent groups t test) related sample means (paired t test)\

10