Statistical Methods I Outline of Topics Types of Data Descriptive Statistics

Outline of Topics Statistical Methods I Tamekia L. Jones, Ph.D. I. Descriptive Statistics ([email protected]) II. Hypothesis Testing Research Assistant Professor III. Parametric Statistical Tests Children’s Oncology Group Statistics & Data Center IV. Nonparametric Statistical Tests Department of Biostatistics V. Correlation and Regression Colleges of Medicine and Public Health & Health Professions 2 Types of Data Descriptive Statistics • Nominal Data – Gender: Male, Female • Descriptive statistical measurements are used in medical literature to summarize data or • Ordinal Data describe the attributes of a set of data – Strongly disagree, Disagree, Slightly disagree, Neutral,,gyg,g, Slightly agree, Agree, Strong gygly agree • Nominal data – summarize using rates/i/proportions. • Interval Data – e.g. % males, % females on a clinical study – Numeric data: Birth weight Can also be used for Ordinal data 3 4 1 Descriptive Statistics (contd) Measures of Central Tendency • Summary Statistics that describe the • Two parameters used most frequently in location of the center of a distribution of clinical medicine numerical or ordinal measurements where – Measures of Central Tendency - A distribution consists of values of a characteristic – Measures of Dispersion and the frequency of their occurrence – Example: Serum Cholesterol levels (mmol/L) 6.8 5.1 6.1 4.4 5.0 7.1 5.5 3.8 4.4 5 6 Measures of Central Tendency (contd) Measures of Central Tendency (contd) Mean (Arithmetic Average) Mean – used for numerical data and for symmetric distributions Median – used for ordinal data or for numerical data where the distribution is skewed Mode – used primarily for multimodal • Sensitive to extreme observations distributions − Replace 5.5 with, say, 12.0 The new mean = 54.7 / 9 = 6.08 7 8 2 Measures of Central Tendency (contd) Measures of Central Tendency (contd) Median (Positional Average) Mode • Middle observation: ½ the values are less than and half the values • The observation that occurs most frequently in the data are greater than this observation • Example: 3.8 4.4 4.4 5.0 5.1 5.5 6.1 6.8 7.1 • Order the observations from smallest to largest Mode = 4.4 3.8 4.4 4.4 5.0 5.1 5.5 6.1 6.8 7.1 • Example: 3.8 4.4 4.4 5.0 5.1 5.5 6.1 6.1 7.1 • Median = middle observation = 5.1 Mode = 4.4; 6.1 • Less Sensitive to extreme observations • Two modes – Bimodal distribution • Replace 5.5 with say 12.0 • New Median = 5.1 9 10 Measures of Central Tendency (contd) Measures of Central Tendency (contd) Shape of the distribution • Symmetric Which measure do I use? Depends on two factors: • Skewed to the Left (Negative) 1. Scale of measurement (ordinal or numerical) and 2. Shape of the Distribution of Observations • Skewed to the Right (Positive) 11 12 3 Measures of Dispersion Measures of Dispersion (contd) • Measures that describe the spread or variation in Range = difference between the largest and the the observations smalles t ob servati on • Used with numerical data to emphasize • Common measures of dispersion extreme values • Range • Standard Deviation • Coefficient of Variation • Serum cholesterol example • Percentiles Minimum = 3.8, Maximum = 7.1 • Inter-quartile Range Range = 7.1 – 3.8 = 3.3 13 14 Measures of Dispersion (contd) Measures of Dispersion (contd) Standard Deviation 6.8 5.1 6.1 4.4 5.0 7.1 5.5 3.8 4.4 Standard Deviation Mean = 5.35 n = 9 – Measure of the spread of the observations about the mean – Used as a measure of dispersion when the mean is used to measure central tendency for symmetric numerical data – Stand ard d evi ati on lik e th e mean requi res numeri cal d at a – Essential part of many statistical tests – Variance = s2 15 16 4 Measures of Dispersion (contd) Measures of Dispersion (contd) Coefficient of Variation • Measure of the relative spread in data If the observations have a Bell-Shaped • Used to compare variability between two numerical data Distrib ut ion, then the fo llowi ng i s al ways t rue - measureddiffd on different sca les • Coefficient of Variation (C of V) = (s / mean) x 100% 67% of the observations lie between X 1sXs and 1 • Example: 95% of the observations lie between X 2sXs and 2 Mean Std Dev (s) C of V Serum Cholesterol (mmol/L) 5.35 1.126 99.7% of the observations lie between X 3sXs and 3 Change in vessel diameter (mm) 0.12 0.29 The Normal (Gaussian) Distribution 17 18 Measures of Dispersion (contd) Coefficient of Variation Measures of Dispersion (contd) • Measure of the relative spread in data e.g. DiMaio et al evaluated the use of the test measuring maternal • Used to compare variability between two numerical data serum alphafetoprotein (for screening neural tube defects), in a measureddiffd on different sca les prospective study of 34,000 women. • Coefficient of Variation (C of V) = (s / mean) x 100% Reproducibility of the test procedure was determined by • Example: repeating the assay 10 times in each of four pools of serum. Mean and s of the 10 assays were calculated in each of the 4 pools. Mean Std Dev (s) C of V Coeffs of Variation were computed for each pool: 7.4%, 5.8%, Serum Cholesterol (mmol/L) 5.35 1.126 21% 2. 7%, and 2. 4%. These values indicate relatively good Change in vessel diameter (mm) 0.12 0.29 241.7% reproducibility of the assay, because the variation as measured by the std deviation, is small relative to the mean. Hence readers of • Relative variation in Change in Vessel Diameter is more their article can be confident that the assay results were than 10 times greater than that for Serum Cholesterol consistent. 19 20 5 Measures of Dispersion (contd) Measures of Dispersion (contd) Percentile Interquartile Range (IQR) • A number that indicates the percentage of the distribution of • Measure of variation that makes use of percentiles data that is equal to or below that number • Difference between the 25th and 75th percentiles • Used to compare an individual value with a set of norms • Contains the middle 50% of the observations (independent of shape of the distribution) • Example - Standard physical growth chart for girls from birth to 36 months of age • Example – • For ggg,irls 21 months of age, the 95th ppgercentile of weight is 13.4 • IQR for weights of 12 month old girls is the difference kg. That is, among 21 month old girls, 95% weigh 13.4 kg or th th less, and only 5% weigh more than 13.4 kg. between 10.2 kg (75 percentile) and 8.8 kg (25 percentile); • i.e., 50% of infant girls at 12 months weigh between 8.8 and • 50th percentile is the Median 10.2 kg. 21 22 Hypothesis Testing Hypothesis Testing (contd) • Permits medical researchers to make generalizations about a population based on results obtained from a • Statistical Hypothesis – a statement about the value of a study population parameter • Confirms (or refutes) the assertion that the observed • Null Hypothesis (Ho ) findings did not occur by chance alone but due to a – Usually the hypothesis that the researcher wants to gather evidence true association between the dependent and against independent variable • Alternative (or Research) Hypothesis (Ha) • The aim of the researcher is to demonstrate that the – Usually the hypothesis for which the researcher wants to gather observed findings from a study are statistically supporting evidence significant. 23 24 6 Hypothesis Testing (contd) Hypothesis Testing (contd) Ho : There is no difference between smokers and nonsmokers Example: A researcher studied the relationship with respect to the risk of developing lung cancer . That is , between Smoking and Lung cancer. the observed difference (in the sample), if any, is by chance alone. Lung Cancer Ha : There is a difference between smokers and nonsmokers Present Absent with respect to the risk of developing lung cancer and that Smoker A B the observed difference ((p)yin the sample) is not by chance Non-Smoker C D alone. Conclusion: If the findings of the study are statistically significant, then reject Ho and fail to reject the alternative hypothesis Ha. 25 26 Hypothesis Testing (contd) Hypothesis Testing (contd) Test Statistic • Statistics whose primary use is in testing hypotheses Types of Errors are called test statistics Truth • Hypothesis testing, thus, involves determining the H True H False value the test statistic must attain in order for the test o o to be declared significant. Accept Ho Correct Type II error DiiDecision Reject H Type I error Correct • The test statistic is computed from the data of the o sample. 27 28 7 Hypothesis Testing (contd) Hypothesis Testing (contd) Alpha (α) = Probability of Type I error; significance level of the test) • Type I Error − Rejecting the null hypothesis when it is true Beta ( β) = Probability of Type II error − If Ho is true in reality and the observed finding of a study is statistically significant, the decision to reject Ho is Power of a test = 1 – β; probability that a test detects differences that incorrect and an error has been made. actually exist; typically use 80% • Type II Error Level of Significance (p-value) in a study: − Failinggj to reject the null hyp othesis when it is false. – Probability of obtaining a result as extreme as or more extreme than the one observed, if the null hypothesis is true − If in reality Ho is false and the observed finding of a study is statistically not significant, the decision to accept Ho is – Probability that the observed result is due to chance alone. incorrect and an error has been made.

Statistical Methods I Outline of Topics Types of Data Descriptive Statistics

Summarize — Summary Statistics

U3 Introduction to Summary Statistics

Descriptive Statistics

Measures of Dispersion for Multidimensional Data

Simple Mean Weighted Mean Or Harmonic Mean

Chapter 4: Fisher's Exact Test in Completely Randomized Experiments

Summary Statistics, Distributions of Sums and Means

Numerical Summary Values for Quantitative Data 35

Pearson-Fisher Chi-Square Statistic Revisited

Hydraulics Manual Glossary G - 3

Notes on Calculating Computer Performance

Cross-Sectional Skewness