Data Analysis: Simple Statistical Tests
Total Page:16
File Type:pdf, Size:1020Kb
VOLUME 3, ISSUE 6 Data Analysis: Simple Statistical Tests It’s the middle of summer, prime should use. Remember, continuous time for swimming, and your local variables are numeric (such as age in CONTRIBUTORS hospital reports several children with years or weight) while categorical vari- Escherichia coli O157:H7 infection. ables (whether yes or no, male or fe- Authors: A preliminary investigation shows male, or something else entirely) are Meredith Anderson, MPH that many of these children recently just what the name says—categories. swam in a local lake. The lake has For continuous variables, we generally Amy Nelson, PhD, MPH been closed so the health depart- calculate measures such as the mean FOCUS Workgroup* ment can conduct an investigation. (average), median and standard de- Now you need to find out whether Reviewers: viation to describe what the variable swimming in the lake is significantly looks like. We use means and medi- FOCUS Workgroup* associated with E. coli infection, so ans in public health all the time; for Production Editors: you’ll know whether the lake is the example, we talk about the mean age real culprit in the outbreak. This Tara P. Rybka, MPH of people infected with E. coli, or the situation occurred in Washington in Lorraine Alexander, DrPH median number of household con- 1999, and it’s one example of the tacts for case-patients with chicken Rachel Wilfert, MPH importance of good data analysis, pox. Editor in chief: including statistical testing, in an outbreak investigation. (1) In a field investigation, you are often Pia D.M. MacDonald, PhD, MPH interested in dichotomous or binary The major steps in basic data analy- (2-level) categorical variables that sis are: cleaning data, coding and represent an exposure (ate potato conducting descriptive analyses; salad or did not eat potato salad) or * All members of the FOCUS calculating estimates (with confi- an outcome (had Salmonella infection dence intervals); calculating meas- Workgroup are named on the last or did not have Salmonella infection). ures of association (with confidence page of this issue. For categorical variables, we cannot intervals); and statistical testing. calculate the mean or median, but we In earlier issues of FOCUS we dis- can calculate risk. Remember, risk is cussed data cleaning, coding and the number of people who develop a descriptive analysis, as well as calcu- disease among all the people at risk lation of estimates (risk and odds) of developing the disease during a and measures of association (risk given time period. For example, in a ratios and odds ratios). In this issue, group of firefighters, we might talk we will discuss confidence intervals about the risk of developing respira- and p-values, and introduce some tory disease during the month follow- basic statistical tests, including chi ing an episode of severe smoke inha- square and ANOVA. lation. Before starting any data analysis, it is important to know what types of The North Carolina Center for Public Health Prepared- Measures of Association variables you are working with. The ness is funded by Grant/Cooperative Agreement Num- ber U90/CCU424255 from the Centers for Disease types of variables tell you which esti- We usually collect information on ex- Control and Prevention. The contents of this publication mates you can calculate, and later, posure and disease because we want are solely the responsibility of the authors and do not which types of statistical tests you to compare two or more groups of necessarily represent the views of the CDC. North Carolina Center for Public Health Preparedness—The North Carolina Institute for Public Health FOCUS ON FIELD EPIDEMIOLOGY Page 2 people. To do this, we calculate measures of association. Table 1. Sample 2x2 table for Hepatitis A at Restaurant A. Measures of association tell us the strength of the asso- ciation between two variables, such as an exposure and a Outcome disease. The two measures of association that we use No Hepatitis Hepatitis A Total most often are the relative risk, or risk ratio (RR), and the A odds ratio (OR). The decision to calculate an RR or an OR depends on the study design (see box below). We inter- Ate salsa 218 45 263 Exposure pret the RR and OR as follows: Did not 21 85 106 RR or OR = 1: exposure has no association with disease eat salsa RR or OR > 1: exposure may be positively associated Total 239 130 369 with disease authors did. They found an odds ratio of 19.6*, meaning RR or OR < 1: exposure may be negatively associated that the odds of getting Hepatitis A among people who ate with disease salsa were 19.6 times as high as among people who did This is a good time to discuss one of our favorite analysis not eat salsa. tools—the 2x2 table. You should be familiar with 2x2 ta- *OR = ad = (218)(85) = 19.6 bles from previous FOCUS issues. They are commonly bc (45)(21) used with dichotomous variables to compare groups of people. The table has one dichotomous variable along the Confidence Intervals rows and another dichotomous variable along the col- umns. This set-up is useful because we usually are inter- How do we know whether an odds ratio of 19.6 is mean- ested in determining the association between a dichoto- ingful in our investigation? We can start by calculating the mous exposure and a dichotomous outcome. For exam- confidence interval (CI) around the odds ratio. When we ple, the exposure might be eating salsa at a restaurant (or calculate an estimate (like risk or odds), or a measure of not eating salsa), and the outcome might be Hepatitis A (or association (like a risk ratio or an odds ratio), that number no Hepatitis A). Table 1 displays data from a case-control (in this case 19.6) is called a point estimate. The confi- study conducted in Pennsylvania in 2003. (2) dence interval of a point estimate describes the precision of the estimate. It represents a range of values on either So what measure of association can we get from this 2x2 side of the estimate. The narrower the confidence inter- table? Since we do not know the total population at risk val, the more precise the point estimate. (3) Your point (everyone who ate salsa), we cannot determine the risk of estimate will usually be the middle value of your confi- illness among the exposed and unexposed groups. That dence interval. means we should not use the risk ratio. Instead, we will calculate the odds ratio, which is exactly what the study An analogy can help explain the confidence interval. Say we have a large bag of 500 red, green, and blue marbles. Risk ratio or odds ratio? We are interested in knowing the percentage of green mar- bles, but we do not have the time to count every marble. The risk ratio, or relative risk, is used when we look at a So we shake up the bag and select 50 marbles to give us population and compare the outcomes of those who an idea, or an estimate, of the percentage of green mar- were exposed to something to the outcomes of those bles in the bag. In our sample of 50 marbles, we find 15 who were not exposed. When we conduct a cohort green marbles, 10 red marbles, and 25 blue marbles. study, we can calculate risk ratios. Based on this sample, we can conclude that 30% (15 out However, the case-control study design does not allow us of 50) of the marbles in the bag are green. In this exam- to calculate risk ratios, because the entire population at ple, 30% is the point estimate. Do we feel confident in risk is not included in the study. That’s why we use odds stating that 30% of the marbles are green? We might ratios for case-control studies. An odds ratio is the odds have some uncertainty about this statement, since there is of exposure among cases divided by the odds of expo- a chance that the actual percent of green marbles in the sure among controls, and it provides a rough estimate of entire bag is higher or lower than 30%. In other words, our the risk ratio. sample of 50 marbles may not accurately reflect the ac- So remember, for a cohort study, calculate a risk ratio, tual distribution of marbles in the whole bag of 500 mar- and for a case-control study, calculate an odds ratio. For bles. One way to determine the degree of our uncertainty more information about calculating risk ratios and odds is to calculate a confidence interval. ratios, take a look at FOCUS Volume 3, Issues 1 and 2. North Carolina Center for Public Health Preparedness—The North Carolina Institute for Public Health VOLUME 3, ISSUE 6 Page 3 Calculating Confidence Intervals Ideally, we would like a very narrow confidence interval, So how do you calculate a confidence interval? It’s cer- which would indicate that our estimate is very precise. tainly possible to do so by hand. Most of the time, how- One way to get a more precise estimate is to take a larger ever, we use statistical programs such as Epi Info, SAS, sample. If we had taken 100 marbles (instead of 50) from STATA, SPSS, or Episheet to do the calculations. The our bag and found 30 green marbles, the point estimate default is usually a 95% confidence interval, but this can would still be 30%, but the 95% confidence interval would be adjusted to 90%, 99%, or any other level depending be a range of 21-39% (instead of our original range of 17- on the desired level of precision.