Thirteen Statistics Every Biostatistician Should Know Annmaria De Mars, the Julia Group, Santa Monica, CA

Paper 3561-2019 Thirteen Statistics Every Biostatistician Should Know AnnMaria De Mars, The Julia Group, Santa Monica, CA ABSTRACT No one knows all of SAS® or all of statistics. There will always be some technique that you don’t know. However, there are a few techniques that anyone in biostatistics should know. Make your life easier by learning to calculate those with SAS. In this session you will learn how to compute and interpret 12 of these techniques, including several statistics that are frequently confused. The following statistics are covered: prevalence, incidence, sensitivity, specificity, attributable fraction, population attributable fraction, risk difference, relative risk, odds ratio, Fisher’s exact test, number needed to treat, and McNemar’s test. The 13th, extra bonus tool, is SAS statistical graphics. With these 13 tools in their tool chest, even non-statisticians or statisticians who are not specialists will be able to answer many common questions in biostatistics. You’re in luck because each of these can be computed with a few statements in SAS. INTRODUCTION No one can know all of SAS or all of statistics, no, not even that annoying guy down the hall. Especially not that annoying guy down the hall. However, certain statistics are so commonly used that every biostatistician should know them. All of these can be computed easily with SAS. Biostatistics, broadly defined, is the application of statistics to topics in biology. However, most people when discussing biostatistics are really focused on biomedical topics, and not, say, the average lifespan of a particular species of ant-decapitating fly. There are five ways (at least) statistics can be applied to the study of disease: 1. How common is it? This is a question of prevalence (how likely you are to have it) and incidence (how likely you are to get it). If you think those two are the same, you should take a course in epidemiology, or just finish reading this paper. 2. What causes it? What are the factors that increase (or decrease) your risk of contracting a disease? 3. What pattern(s) does it follow? What is the prognosis? Are you likely to die of it quickly, eventually or never? To determine if a treatment is effective for cancer of the eyelashes, we need to first have an idea of what the probability of disability or death is when one is left untreated and over how long of a period of time, that is, what is the “natural progression” of a disease 4. How effective are attempts to prevent or treat a disease? 5. Developing policies to minimize disease. This paper covers basic statistics to address the first four questions. Policy should be developed based on the application of answers to those questions. While there is obviously a lot to be learned in the field of biostatistics, and a wide range of SAS procedures that can be applied, there are a basket of techniques that ought to be in everyone’s hand, computable with SAS. These are prevalence, incidence, sensitivity, specificity, attributable 1 fraction, population attributable fraction, risk difference, relative risk, odds ratio, Fisher’s exact test, number needed to treat, and McNemar’s test. That’s only 12. What’s the extra? Graphs. I admit, I cheated by including a whole category, but you’ll find that ROC curves, survival curves, maps and odds ratio plots are invaluable in explaining results to a non- technical audience. PREVALENCE AND INCIDENCE – TWO DIFFERENT ANSWERS TO THE SAME (MORE OR LESS) QUESTION Policy makers have very good reasons for wanting to know how common a condition or disease is. It allows them to plan and budget for treatment facilities, supplies of medication, rehabilitation personnel. There are two broad answers to the question, “How common is condition X?” and, interestingly, both of these use the exact same SAS procedures. HOW TO COMPUTE PREVALENCE USING SAS Prevalence rate is the proportion of persons with a condition divided by the number in the population. It’s often given as per thousand, or per 100,000, depending on how common the condition is. In brief, prevalence is how likely a person is to have condition X. Assuming that your data are already cleaned and you have a variable with a binary coding for “has disease”, “doesn’t have disease”, (pretty big assumptions) you can simply do a PROC FREQ. PROC FREQ DATA = yourdatasetname ; TABLES variable ; In the example below, from the California Health Interview Survey, approximately 11% of the respondents had been told they had Diabetes, giving a prevalence of 110 per 1,000. Diabetes- ever Cumulative Cumulative AB22 Frequency Percent Frequency Percent 1 4701 10.95 4701 10.95 2 38234 89.05 42935 100.00 Table 1. Output from PROC FREQ for Prevalence of Diabetes HOW TO COMPUTE INCIDENCE USING SAS INCIDENCE RATE is the rate at which new cases are occurring. Incidence is computed by dividing the number of new cases that occur in a specified period by the number of people in the population at risk. The population at risk in the next example was defined as all infants born in 2014. If you are interested in birth statistics, the National Center for Health Statistics is highly recommended as a source. This example is from a public use data set of all 40,002 births in U.S. territories in the year 2014. (Winning trivial pursuit fact: The United States has 16 territories.) Both incidence and prevalence estimates assume an accurate definition of cases, which requires understanding the data and the diagnosis. This data set required a very slight amount of coding because the Down syndrome variable at birth is coded as C for Confirmed, 2 N for No and P for Pending (Center for Disease Control, 2014). “Pending” means that the medical personnel suspect Down syndrome but they are waiting for the results of a chromosomal analysis. “Confirmed” means the analysis has confirmed a diagnosis of Down syndrome. Based on the presumption that most experienced medical personnel recognize Down syndrome, these two categories were combined, using IF/THEN and ELSE statements. LIBNAME out “C:\Users\me\mydir\” ; DATA incidence ; SET out.birth2014 ; IF ca_down in (“C”,”P”) THEN down = “Y” ; ELSE down = “N” ; Again, a PROC FREQ is used. The difference between incidence and prevalence is not in the computation but in the selection of the population and definition of the numerator. Because Down syndrome is present at birth and never acquired afterward, the new cases are going to be those children born in the year 2014 who have a diagnosis of Down syndrome and the denominator will be all births during the year. PROC FREQ DATA = out.birth2014 ; TABLES down ; Results are shown in Table 2 below. Down Syndrome Cumulative Cumulative Down Frequency Percent Frequency Percent N 39963 99.90 39963 99.90 Y 39 0.10 40002 100.00 Table 2. Output from PROC FREQ for Incidence of Down syndrome The incidence rate is .10 or 1 per 1,000. Since this estimate falls perfectly in line with the World Health Organization (2016) estimate of between 1 in 1,000 to 1,100 live births, it appears that the case definition was appropriate. SENSITIVITY AND SPECIFICITY – TWO ANSWERS TO A SECOND QUESTION: DO YOU HAVE A DISEASE? Both sensitivity and specificity address the same question – how accurate is a test for disease – but from opposite perspectives. Sensitivity is defined as the proportion of those who have the disease that are correctly identified as positive. Specificity is the proportion of those who do not have the disease who are correctly identified as negative. Specificity and sensitivity can be computed simultaneously, as shown in the example below using a hypothetical Disease Test. The results are in and the following table has been obtained: Disease No Disease Test Positive 240 40 Test Negative 60 160 Table 3 Results from Hypothetical Screening Test 3 COMPUTING SENSITIVITY AND SPECIFICITY USING SAS Step 1 (optional): Reading the data into SAS. If you already have the data in a SAS data set, this step is unnecessary. The example below demonstrates several SAS statements in reading data into a SAS dataset when only aggregate results are available. The ATTRIB statement sets the length of the result variable to be 10, rather than accepting the SAS default of 8 characters. The INPUT statement uses list input, with a $ signifying character variables. DATALINES; a statement on a line by itself, precedes the data. (Trivial pursuit fact #2: CARDS; will also work, dating back to the days when this statement was followed by cards with the data punched on them.) A semi-colon on a line by itself denotes the end of the data. DATA diseasetest; ATTRIB result LENGTH= $10; INPUT result $ disease $ weight; DATALINES; positive present 240 positive absent 40 negative present 60 negative absent 160 ; Step 2: PROC FREQ PROC FREQ DATA= diseasetest ORDER=FREQ ; TABLES result* disease; WEIGHT weight; Yes, another PROC FREQ. The ORDER = FREQ option is not required but it makes the data more readable, in my opinion, because with these data the first cell will now be those who had a positive result and did, in fact, have the disease and this is the format in which sensitivity and specificity data are typically presented. The total for column 1 is the numerator for the formula for sensitivity, which is: Sensitivity = (Number tested positive)/ (Total with disease). TABLES variable1*variable2 will produce a cross-tabulation with variable1 as the row variable and variable2 as the column variable. Weight weightvariable will weight each record by the value of the weight variable. The variable was named ‘weight’ in the example above but any valid SAS name is acceptable.

Load more