5 Statistical and Epidemiological Issues 2
Total Page:16
File Type:pdf, Size:1020Kb
Statistical and Epidemiological issues with the Comet assay Riga, 29 Jan - 1 Feb 2019 DESCRIPTIVE STATISTICS SUMMARIZE, ORGANIZE OR REDUCE LARGE NUMBERS OF OBSERVATIONS SOMETIMES CALLED SUMMARY STATISTICS What is a measure of Central Tendency? • Numbers that describe what is average or typical of the distribution • You can think of this value as where the middle of a distribution lies. Chapter 4 – 3 The Mode • The category or score with the largest frequency (or percentage) in the distribution. • The mode can be calculated for variables with levels of measurement that are: nominal, ordinal, or interval-ratio. Chapter 4 – 4 The Median • The score that divides the distribution into two equal parts, so that half the cases are above it and half below it. • The median is the middle score, or average of middle scores in a distribution. Chapter 4 – 5 Median Exercise #1 (N is odd) Calculate the median for this hypothetical distribution: Job Satisfaction Frequency 1) Very High 2 2) High 3 3) Moderate 5 4) Low 7 5) Very Low 4 TOTAL 21 1122233333 4 4444445555 Chapter 4 – 6 The Mean • The arithmetic average obtained by adding up all the scores and dividing by the total number of scores. Chapter 4 – 7 Formula for the Mean åY Y = N “Y bar” equals the sum of all the scores, Y, divided by the number of scores, N. Chapter 4 – 8 6 Mean = 4.6 Median = 5.0 5 Mode = 4.0 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 V1 Shape of the Distribution • Symmetrical (mean is about equal to median) • Skewed –Positively (example: income) mean > median • Bimodal (two distinct modes) • Multi-modal (more than 2 distinct modes) Chapter 4 – 10 Considerations for Choosing a Measure of Central Tendency • For a nominal variable, the mode is the only measure that can be used. • For ordinal variables, the mode and the median may be used. The median provides more information (taking into account the ranking of categories.) • For interval-ratio variables, the mode, median, and mean may all be calculated. The mean provides the most information about the distribution, but the median is preferred if the distribution is skewed. Chapter 4 – 11 Measures of Dispersion The measures of central tendency are not adequate to describe data. Two data sets can have the same mean but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This is given by the measures of dispersion. Range, interquartile range, and standard deviation are the three commonly used measures of dispersion. Range The range is the difference between the largest and the smallest observation in the data. The prime advantage of this measure of dispersion is that it is easy to calculate. On the other hand, it has lot of disadvantages. It is very sensitive to outliers and does not use all the observations in a data set.[1] It is more informative to provide the minimum and the maximum values rather than providing the range. Range = x - x (max) (min) INTERQUARTILE RANGE Interquartile range is defined as the difference between the 25th and 75th percentile (also called the first and third quartile). Hence the interquartile range describes the middle 50% of observations. If the interquartile range is large it means that the middle 50% of observations are spaced wide apart. The important advantage of interquartile range is that it can be used as a measure of variability if the extreme values are not being recorded exactly (as in case of open-ended class intervals in the frequency distribution). Other advantageous feature is that it is not affected by extreme values. The main disadvantage in using interquartile range as a measure of dispersion is that it is not amenable to mathematical manipulation. STANDARD DEVIATION Standard deviation (SD) is the most commonly used measure of dispersion. It is a measure of spread of data about the mean. SD is the square root of sum of squared deviation from the mean divided by the number of observations. Standard deviation æ (x - x)2 ö s = ç å ÷ ç ÷ è n -1 ø 700 600 xi 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 700 600 xi 500 400 300 200 100 - Σ xi – x = 0 0 0 100 200 300 400 500 600 700 800 700 600 xi 500 400 300 200 -2 Σ (xi – x) = deviation 100 - Σ xi – x = 0 0 0 100 200 300 400 500 600 700 800 700 600 xi 500 400 300 200 - 2 Σ (xi – x) = deviation 100 - -2 Σ xi – x = 0 Σ (xi – x) /N = Variance 0 0 100 200 300 400 500 600 700 800 Variance • A measure of dispersion Variance among individual observations about their average value æ (x - x)2 ö s2 =ç å ÷ • Computed before the ç n -1 ÷ standard deviation è ø Standard deviation • Another measure of dispersion Standard deviation • 68% of observations should be within ± 1 standard deviation of 2 the mean æ (x - x) ö s = ç å ÷ ç ÷ • 95% will be within 1.96 è n -1 ø standard deviations Standard error of the mean The standard deviation of the sample mean is equivalent to the standard deviation of the error in the sample mean with respect to the true mean, since the sample mean is an unbiased estimator. The SEM can also be understood as the standard deviation of the error in the sample mean with respect to the true population mean (or an estimate of that statistic). æ s ö SE =ç ÷ è n ø Statistical inference is the process through which inferences about a population are made based on certain statistics calculated from a sample of data drawn from that population. From: Principles and Practice of Clinical Research (Third Edition), 2012 Statistical inference is a data analysis technique used to deduce properties of an underlying probability distribution. Inferential statistical analysis utilizes hypothesis testing and estimate derivation to infer properties of a population. It is assumed that the observed data set is sampled from a larger population. What is "statistical significance" (p-value). The value of the p-value represents a decreasing index of the reliability of a result. The higher the p- value, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Hypotheses • H0 (null hypothesis) claims “no difference” • Ha (alternative hypothesis) contradicts the null • Example: We test whether an exposed population shows an increased level of DNA damage … H0: no average increased DNA damage in population Ha: H0 is wrong (i.e., there was an increased level of DNA damage in the exposed population) Real World Null is Null is true false Correct Type II β true decision error (fn) Null is Null Correct Type I α decision significance test false Conclusion of the error (fp) Null is Null Type I error A type I error occurs when the null hypothesis (H0) is true, but is rejected. It is asserting something that is absent, a false hit. A type I error may be likened to a so-called false positive (a result that indicates that a given condition is present when it actually is not present). The type I error rate or significance level is the probability of rejecting the null hypothesis given that it is true. It is denoted by the Greek letter α (alpha) and is also called the alpha level. Often, the significance level is set to 0.05 (5%), implying that it is acceptable to have a 5% probability of incorrectly rejecting the null hypothesis. Significance Testing • Also called “hypothesis testing” • Objective: to test a claim about parameter μ • Procedure: A.State hypotheses H0 and Ha B.Calculate test statistic C.Convert test statistic to P-value and interpret D.Consider significance level (optional) Significance Level • α ≡ threshold for “significance” • We set α • For example, if we choose α = 0.05, we require evidence so strong that it would occur no more than 5% of the time when H0 is true • Decision rule P ≤ α Þ statistically significant evidence P > α Þ nonsignificant evidence • For example, if we set α = 0.01, a P-value of 0.0006 is considered significant 3/6/19 Basics of Significance Testing 29 In a two-tailed test, the rejection region for a significance level of α=0.05 is partitioned to both ends of the sampling distribution and makes up 5% of the area under the curve (white areas). Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study Select the proper Determine the distribution of hypothesis to be statistics the selected statistics tested Perform statistical test Assess decision Decide according to in the study groups. rules the test results Reject H because H is Do not reject H0 – May 0 1 be true (most likely) true Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study Select the proper Determine the distribution of hypothesis to be statistics the selected statistics tested 50 COPD patients with a new rehabilitaion strategy have >> 6MWT as compared to 50 patients SoC Perform statistical test Assess decision Decide according to in the study groups. rules the test results Reject H because H is Do not reject H0 – May 0 1 be true (most likely) true Hypothesis testing evaluates at which level of confidence H0 can be rejected Choose the study Select the proper Select the distribution of the hypothesis to be statistics selected statistics tested 50 COPD patients with 6MWT has a normal a new rehabilitaion distribution and there strategy have >> are 2 study group = 6MWT as compared to Student’s t test 50 patients SoC Perform statistical test Assess decision Decide according to in the study groups.