Introduction to Biostatistics
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Biostatistics Jie Yang, Ph.D. Associate Professor Department of Family, Population and Preventive Medicine Director Biostatistical Consulting Core Director Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). OUTLINE What is Biostatistics What does a biostatistician do • Experiment design, clinical trial design • Descriptive and Inferential analysis • Result interpretation What you should bring while consulting with a biostatistician WHAT IS BIOSTATISTICS • The science of (bio)statistics encompasses the design of biological/clinical experiments the collection, summarization, and analysis of data from those experiments the interpretation of, and inference from, the results How to Lie with Statistics (1954) by Darrell Huff. http://www.youtube.com/watch?v=PbODigCZqL8 GOAL OF STATISTICS Sampling POPULATION Probability SAMPLE Theory Descriptive Descriptive Statistics Statistics Inference Population Sample Parameters: Inferential Statistics Statistics: 흁, 흈, 흅… 푿ഥ , 풔, 풑ෝ,… PROPERTIES OF A “GOOD” SAMPLE • Adequate sample size (statistical power) • Random selection (representative) Sampling Techniques: 1.Simple random sampling 2.Stratified sampling 3.Systematic sampling 4.Cluster sampling 5.Convenience sampling STUDY DESIGN EXPERIEMENT DESIGN Completely Randomized Design (CRD) - Randomly assign the experiment units to the treatments Design with Blocking – dealing with nuisance factor which has some effect on the response, but of no interest to the experimenter; Without blocking, large unexplained error leads to less detection power. 1. Randomized Complete Block Design (RCBD) - One single blocking factor 2. Latin Square 3. Cross over Design Design (two (each subject=blocking factor) 4. Balanced Incomplete blocking factor) Block Design EXPERIMENT DESIGN Factorial Design: similar to randomized block design, but allowing to test the interaction between two treatment effects. A significant interaction between A and B tells: • the effect of A is different at each level of B. Or the effect of B differs at each level of A. • it is not very sensible to even be talking about the main effect of A and B Experiment with random factors: randomly select n of the possible levels of the factor of interest. Typically random factors are categorical. Split-plot Design: confounding a main effect with blocks EVIDENCE PYRAMID IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019] WHAT CAN A STATISTICIAN HELP DURING STUDY DESIGN PHASE Blinding/masking and randomization The number and combination of experimental inventions The timing of measurements or visits Collect information on a larger sample or on the same sample over time Ways to maximize the efficient use of the available resources Even for data management – how to code measures and what to computerize directly affect the ease even the feasibility of subsequent analysis How a biostatistician analyzes data TYPE OF DATA 1. Nominal data: unordered categories or classes e.g. gender, blood type, transplant type 2. Ordinal data: order among categories is important e.g. disease severity, AE level 3. Discrete data: both ordering and magnitude are important; often integers or counts, no intermediate values are possible e.g. # of accidents within a month, # of kids in a family 4. Continuous data: difference between two possible data values can be arbitrarily small e.g. height, weight, body temperature, serum level, BP 5. Time to event data: censoring presents e.g. overall survival DESCRIPTIVE STATISTICS General goal is to describe the distribution of a single variable (center, spread, shape, functional form) Helpful for checking data and assumptions Stratified (by group) analysis can be done for groups of interest Values and comparisons can be visualized and “estimated” but descriptive statistics alone will provide no information about our level of confidence in conclusions DESCRIPTIVE STATISTICS 1. Measure of central tendency Mean: average Median: the 50th percentile point (median value); Mode: value that occurs most frequently; unimodal and multimodal DESCRIPTIVE STATISTICS • Reporting a measure of center gives only partial information about a data set. – Example: Consider the following three datasets: Dataset 1: 4 5 5 5 6 Dataset 2: 1 3 5 7 9 Dataset 3: 1 5 5 5 9 All the three datasets have identical means and medians. Datasets 2&3 are more variable than the 1st one. • It is also important to describe the spread of values about the center. DESCRIPTIVE STATISTICS 2. Measure of variability • Range= Max –Min • Inter-Quartile Range (IQR)=Q3-Q1 • Variance, Sample Variance • Standard Deviation, Sample Standard Deviation IDENTIFYING POTENTIAL OUTLIERS • An Outlying Value is a value, X, such that X> Q3+ 1.5(IQR) or X< Q1–1.5(IQR) • An Extreme Outlying Value is a value, X, such that X> Q3+ 3(IQR) or X< Q1–3(IQR) EFFECTS OF OUTLIERS Median and IQR are generally unaffected by the removal of outliers but minor changes are possible. Mean and Standard Deviation will be affected by the outlying values. Apparent shape of the distribution can also be affected by outlying values. One should never simply remove data values from a dataset. In practice, if the outliers are not errors, sensitivity analysis will often be conducted or robust statistical methods will be used. WAYS OF PRESENTING DATA • Summary table • Bar/Pie chart • Histogram • Scatter plot • Boxplot 1. Outlier 2. Extreme Outlier 3. Modified Boxplot SUMMARY TABLE 1. By one variable side N Mean SD Median Min Max left 14 18.83 6.04 18.25 8.00 30.10 right 14 18.61 5.48 17.75 8.80 28.21 Variable N_missing Level Total (N=25) Case (N=12) Control (N=13) No 24 (96.00%) 11 (91.67%) 13 (100.00%) Cancer 0 Yes 1 (4.00%) 1 (8.33%) 0 (0.00%) 2. By multiple categorical variables Radiation Location of Sequence with Before 2002 After 2002 Tumor Surgery Lower Preoperative 107(19.21%) 65(15.66%) (n = 972) Postoperative 450(80.79%) 350(84.34%) Upper Preoperative 20(13.16%) 21(16.03%) (n = 283) Postoperative 132(86.84%) 110(83.97%) BAR CHART AND PIE CHART Bariatric surgeries, 2010-2013 14.00% 12.76% 12.00% 10.35% 10.00% 9.32% 8.12% 8.00% 7.44% 5.75% 6.00% 4.46% 3.59% 4.00% 2.88% 2.00% 0.00% AGB LSG RYGB Diagnosis for Cholecystectomy patients, ED revisit Admitted from ED Discharged from ED 2006-2013 HISTOGRAM AND SCATTER PLOT DISC measurement by group 70 60 50 40 Left 30 20 10 0 0 10 20 30 40 50 60 70 Right Control No Surgery Gamma-knife Resection BOX-PLOT One continuous variable and one categorical variable OTHER THEORETICAL DISTRIBUTIONS Variable Type of Outcome Theoretical Distribution Continuous numeric Normal, Log-normal, Exponential,… Discrete numeric Poisson, Negative Binomial,… Binary Bernoulli, Binomial,…. Categorical with multiple Multinomial, categories Hypergeometric,… CONFIDENCE INTERVALS Population Sample Parameters: Statistics: 푿ഥ , 풔, 풑ෝ ,… 흁, 흈, 흅… A point estimate alone is not enough: it gives us no way to judge how accurate it is as an estimator. A confidence interval provides a better estimate by combining the point estimate with its standard error to define a range of values that are likely to cover the true value of the parameter. A confidence intervals starts with the point estimate and adds a “margin of error.” A confidence interval is defined as: point estimate +/- margin of error. CONFIDENCE INTERVALS 95% CI for μ: P(-??<µ<??)=0.95 Since by central limit theorem, 2 x ~ N(x , x ) x P(1.96 1.96) 0.95 / n P(x 1.96 * x 1.96 * ) 0.95 n n CONFIDENCE INTERVALS 95% Confidence Interval (CI) for µ: x 1.96 * n Interpretation 1: You can be 95% sure that the true mean (μ) will fall within the upper and lower bounds. Interpretation 2: 95% of the intervals constructed using sample means (x) will contain the true population mean (μ). 100(1-α)% CI: x Z 1 / 2 n A good link for simulation of CI: http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html The cartoon guide to Statistics by Gonick and Smith HYPOTHESIS TESTING • Using data to test specific hypotheses • Making decisions based on probability (instead of subjective impressions) • Distribution is usually assumed • Methods that require no distributional assumptions are called non-parametric or distribution free WHAT IS A HYPOTHESIS • A well formulated hypothesis will be both quantifiable and testable, that is, involve measurable quantities or refer to items that may be assigned to mutually exclusive categories. • Takes one of two forms: “ Some measurable characteristic of a population takes one of a specific set of values” “ Some measurable characteristic takes different values in different populations, the difference has a specific pattern or a specific set of values” EXAMPLES This new drug will lower diastolic blood pressure. For males over 40 suffering from chronic hypertension, a 100mg daily dose of this new drug will lower diastolic blood pressure an average of 10 mm Hg. BASIC DEFINITIONS AND NOTATION • The Null hypothesis describes some aspect of the statistical behavior of a set of data and is denoted H0 • This description is treated as valid unless the actual behavior of the data contradicts this assumption • The Alternative Hypothesis is generally the “opposite” of the null hypothesis and is denoted H1 HYPOTHESIS • The null hypothesis is usually of the form • The alternative will take on one of the following forms ERRORS IN HYPOTHESIS TESTING The facts Decisions No Difference Drug is Better No Difference Correct Type I error: Manufacturer wastes money developing an ineffective drugs Drug is better Type II error: Correct Manufacturer misses opportunity for profit; Public denied access to effective treatment There is a trade off between type I error, α, and type II error, β P-VALUE Definition: the probability of obtaining a test statistic as extreme as or more extreme than the actual test statistic obtained, given that the null hypothesis is true.