Statistics and Epidemiology
Total Page:16
File Type:pdf, Size:1020Kb
Slide 1 Statistics and Epidemiology Review of Book 7 SEER Program Self Instructional Manual for Cancer Registrars Cancer Information Management Program Outcomes, Data Quality and Data Utilization Course Statistics, Epidemiology and Data Utilization Module When or where do you use statistics in your everyday operations? Annual reports, data requests? How about evaluating quality or productivity, cancer conference documentation, survey documentation, budgets…just to name a few? If you stop and think about it, you may see that you are already doing many of the statistical techniques discussed in Book 7. You may just not think about yourself in terms of being a “statistician”. This presentation will provide a brief overview of the statistical and epidemiological methods introduced in Book 7: Statistics and Epidemiology for Cancer Registries. You may find it helpful to follow along in your book. Slide 2 Statistics Branch of mathematics Collection, summarization, analysis, interpretation, and presentation of masses of numerical data We will begin with statistics. Statistics is the science of gathering and interpreting facts and figures. Statistics can be further described as descriptive or inferential. Descriptive statistics will be discussed in more detail in the following slides. Inferential statistics refer to the procedures used to make a statement about a population based upon the results of a sample. Inferential statistics is related to methods applied in hypothesis testing. Cancer registrars would usually seek assistance from a statistician or researcher before attempting to conduct a study using hypothesis testing. Slide 3 Statistical Analysis Summarize the essential features and relationships of the data Reveal the major characteristics of the patient group Determine broad patterns of behavior or tendencies Statistical Analysis is the process of: -summarizing data -identifying the major characteristics of the patient group based on the data -then using the data and those major characteristics identified to determine broad patterns of behavior of a group of people. Slide 4 Descriptive Statistics SEER Book 7, Section B Descriptive statistics uses numerical summaries to describe an observed frequency distribution. Study Hint: Have you ever read a definition of a term or phrase and you had to look up the definition of the words contained in the original definition? Try to think of the principles of statistics as stepping stones. Such as you need to know how to add and subtract before you can do algebra. It is important to have a clear understanding of the basic terminology. You will see them used again and again as we get into the more complicated side of statistics and epidemiology. For example: Once you have a clear understanding of the use of the term “mean”, those same principles are applied to any other definition or phrase including the word “mean”. A mean tumor size or mean survival time is still based on calculating averages. Slide 5 Shorthand notation X = value of an observed measurement (X=…) ∑ = sum of the values of X n = number (count) of observations in our group _ n – 1 = degrees of freedom (the number of __ observations that are free to vary) X = mean, average value SD = standard deviation √ = square root This slide contains a reference key of symbols that will be used in the formulas on the upcoming slides. It is not necessary to memorize these symbols. It is more for your own use during your independent study time and for understanding the shorthand used in the workbook. The calculation of variance introduces the term “Degrees of freedom”. If you have 10 observations and the sum of the observation is 100, the first 9 (10-1) can be any number. The last number cannot. The last number has to be the number that when added to the 1st nine equals 100. Therefore, 9 of the numbers are free to vary. Slide 6 How do we summarize a set of data? Characterize a set of data in terms of: 1) Central values about which the data tend to cluster 2) The amount of spread or the dispersion of the observations If you will remember, statistics is the collection and summation of masses of numerical data, so… If measurable characteristics among individuals did not vary, describing a set of data would be completed after the first observation. Example: If everyone’s blood pressure was 110/70, then there would be no need in taking BP at each doctor visit. We will see this concept again when we look at the normal curve. Slide 7 Measures of Central Tendency Central values about which the data tend to cluster “Typical” values Example: Average tumor size Measuring central tendency are important because measurable characteristics (such as age and stage) vary from individual to individual. Therefore, we need to summarize the data in order to analyze the results. Typical values = what you see most often. The graphic at the bottom is a visual reminder. When you think of measures of central tendency, think about how the data clusters in the middle of the observations. Slide 8 Measures of Central Tendency Widely used measures of central tendency Mean Median Mode The tools we use to calculate the typical values are the 3 M’s…mean, median, and mode. To help describe mean, median, and mode, we will use an example of five tumor sizes. Slide 9 Mean _ Average (X) Influenced more by extreme values _ Sum of all values ∑X 8+5+3+6+3 25 X = -------------------- = ---- = ---------------- = --- = 5 # of values n 5 5 Mean = the statistical verbiage for average. Q: What is the mean in our example of tumor sizes? A: Add all of the values together (25) and divide by the number of values (5). The mean is 5. Q: What would happen if we added an extreme value (20) to our set of tumor sizes? A: Add 20 to the list of tumor sizes. Mean = 45 / 6 = 7.5. As you can see, the mean is more influenced by extreme values than is median and mode. Note: It is not necessary to memorize the formulas in this presentation. They are provided only to help with demonstrating and understanding the use of the terms and definitions. Slide 10 Median Middle value Sort the observations in order from smallest to largest Stable measure 83 53 35 6 38 Median is the 50th percentile – ½ of the values are smaller and ½ are larger Q: What is the median in our example of tumor sizes? A: First we have to sort them in order from smallest to largest (click), then take the middle value (click). Median is 5 (click). Q: What would happen to our median if we added an extreme value. Let’s use 20 again? A: Add 20 to the list of tumor sizes - 3,3,5,6,8,20. The middle falls between 5 and 6. If the middle falls between two values, then average the two middle values. Median is now 5.5. Stable measure – adding extreme values to a series of observations tends to cause only a limited change in the value of the median Slide 11 Mode Most frequently seen value 3 3 5 6 8 There may be no modal value (3, 5, 6, 8) There may be more than one modal value (3, 3, 5, 6, 6, 8) The mode is the value that occurs most frequently. A distribution with two most-common values is called a bimodal distribution. Slide 12 Measures of Variation Amount of spread or dispersion of the observations Example: Fluctuation of tumor sizes Again, the graphic at the bottom is a visual reminder. When you think of measures of variation, think about how the data spreads out along the observations. Slide 13 Measures of Variation Widely used measures of variation Range Standard Deviation (SD) The tools we use to measure variation are range and standard deviation. The standard deviation is a companion to the mean. Q: What type of measure is the mean? A: Central Tendency (typical values, clustering in the middle) Q: And, what type of measure is the standard deviation? A: Variation (spread, dispersion) So, the standard deviation expresses the spread of data about the mean. Make a mental note of this. You will see this concept again when we talk about the normal curve. Slide 14 Range Difference between the highest and lowest values Easiest measure of variation Greatly influenced by extreme values Highest # - Lowest # = 8 – 3 = 5 Easiest measure…it’s just simple subtraction. Q: What is the range in our example of tumor sizes? A: The highest number was 8. The lowest number was 3. 8-3 = 5. Q: What would happen to the range if we added an extreme value. Let’s use 20 again? A: 20 – 3 = 17 As you can see, range is greatly influenced by extreme values. Study Hint: An exam question may not be worded exactly as they are seen on the slides. It may be asked with a slight twist, so-to-speak. We’ve talked mostly in terms of measures that are most influenced by extreme values. A good exam question that comes to mind is: Q: Given a set of values, which is least likely to be influenced by an extreme value? Mean, Median, Mode, Range A: Median Slide 15 Standard Deviation How far the observations tend to vary from the mean Square root of variance _ ∑ (X – X)2 18 18 SD = √ --------------- = √ ------ = √ ---- = √4.5 = 2.12 (n-1) (5-1) 4 Standard Deviation….sounds and looks complicated, but the calculation is based on fairly simple mathematics. Q: First of all, how did we get 18 in the example above? A: Remember, we are still using our example of 5 tumor sizes (8,5,3,6,3) and the symbols used here are from our reference key.