Introduction to Biostatistics

Jie Yang, Ph.D.

Associate Professor Department of Family, Population and Preventive Medicine Director Biostatistical Consulting Core Director Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center

In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). OUTLINE

What is Biostatistics What does a biostatistician do • Experiment design, clinical trial design • Descriptive and Inferential analysis • Result interpretation What you should bring while consulting with a biostatistician WHAT IS BIOSTATISTICS • The science of (bio)statistics encompasses the design of biological/clinical experiments the collection, summarization, and analysis of data from those experiments the interpretation of, and inference from, the results How to Lie with Statistics (1954) by Darrell Huff. http://www.youtube.com/watch?v=PbODigCZqL8 GOAL OF STATISTICS

Sampling

POPULATION Probability SAMPLE Theory

Descriptive Descriptive Statistics Statistics Inference Population Sample Parameters: Inferential Statistics Statistics: 흁, 흈, 흅… 푿ഥ , 풔, 풑ෝ,… PROPERTIES OF A “GOOD” SAMPLE

• Adequate sample size (statistical power) • Random selection (representative) Sampling Techniques: 1.Simple random sampling 2.Stratified sampling 3.Systematic sampling 4.Cluster sampling 5.Convenience sampling STUDY DESIGN EXPERIEMENT DESIGN

Completely Randomized Design (CRD) - Randomly assign the experiment units to the treatments Design with Blocking – dealing with nuisance factor which has some effect on the response, but of no interest to the experimenter; Without blocking, large unexplained error leads to less detection power. 1. Randomized Complete Block Design (RCBD) - One single blocking factor

2. Latin Square 3. Cross over Design Design (two (each subject=blocking factor) 4. Balanced Incomplete blocking factor) Block Design EXPERIMENT DESIGN

Factorial Design: similar to randomized block design, but allowing to test the interaction between two treatment effects. A significant interaction between A and B tells: • the effect of A is different at each level of B. Or the effect of B differs at each level of A. • it is not very sensible to even be talking about the main effect of A and B Experiment with random factors: randomly select n of the possible levels of the factor of interest. Typically random factors are categorical. Split-plot Design: confounding a main effect with blocks EVIDENCE PYRAMID

IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019] WHAT CAN A STATISTICIAN HELP DURING STUDY DESIGN PHASE

Blinding/masking and randomization The number and combination of experimental inventions The timing of measurements or visits Collect information on a larger sample or on the same sample over time Ways to maximize the efficient use of the available resources Even for data management – how to code measures and what to computerize directly affect the ease even the feasibility of subsequent analysis How a biostatistician analyzes data TYPE OF DATA

1. Nominal data: unordered categories or classes e.g. gender, blood type, transplant type 2. Ordinal data: order among categories is important e.g. disease severity, AE level 3. Discrete data: both ordering and magnitude are important; often integers or counts, no intermediate values are possible e.g. # of accidents within a month, # of kids in a family 4. Continuous data: difference between two possible data values can be arbitrarily small e.g. height, weight, body temperature, serum level, BP 5. Time to event data: censoring presents e.g. overall survival DESCRIPTIVE STATISTICS

General goal is to describe the distribution of a single variable (center, spread, shape, functional form)

Helpful for checking data and assumptions

Stratified (by group) analysis can be done for groups of interest

Values and comparisons can be visualized and “estimated” but descriptive statistics alone will provide no information about our level of confidence in conclusions DESCRIPTIVE STATISTICS

1. Measure of central tendency Mean: average Median: the 50th percentile point (median value); Mode: value that occurs most frequently; unimodal and multimodal DESCRIPTIVE STATISTICS

• Reporting a measure of center gives only partial information about a data set. – Example: Consider the following three datasets: Dataset 1: 4 5 5 5 6 Dataset 2: 1 3 5 7 9 Dataset 3: 1 5 5 5 9 All the three datasets have identical means and medians. Datasets 2&3 are more variable than the 1st one. • It is also important to describe the spread of values about the center. DESCRIPTIVE STATISTICS

2. Measure of variability • Range= Max –Min • Inter-Quartile Range (IQR)=Q3-Q1 • Variance, Sample Variance • Standard Deviation, Sample Standard Deviation IDENTIFYING POTENTIAL OUTLIERS

• An Outlying Value is a value, X, such that X> Q3+ 1.5(IQR) or X< Q1–1.5(IQR) • An Extreme Outlying Value is a value, X, such that X> Q3+ 3(IQR) or X< Q1–3(IQR) EFFECTS OF OUTLIERS

Median and IQR are generally unaffected by the removal of outliers but minor changes are possible. Mean and Standard Deviation will be affected by the outlying values. Apparent shape of the distribution can also be affected by outlying values. One should never simply remove data values from a dataset. In practice, if the outliers are not errors, sensitivity analysis will often be conducted or robust statistical methods will be used. WAYS OF PRESENTING DATA

• Summary table • Bar/Pie chart • Histogram • Scatter plot • Boxplot 1. Outlier 2. Extreme Outlier 3. Modified Boxplot SUMMARY TABLE

1. By one variable side N Mean SD Median Min Max left 14 18.83 6.04 18.25 8.00 30.10 right 14 18.61 5.48 17.75 8.80 28.21

Variable N_missing Level Total (N=25) Case (N=12) Control (N=13)

No 24 (96.00%) 11 (91.67%) 13 (100.00%) Cancer 0 Yes 1 (4.00%) 1 (8.33%) 0 (0.00%)

2. By multiple categorical variables

Radiation Location of Sequence with Before 2002 After 2002 Tumor Surgery Lower Preoperative 107(19.21%) 65(15.66%) (n = 972) Postoperative 450(80.79%) 350(84.34%) Upper Preoperative 20(13.16%) 21(16.03%) (n = 283) Postoperative 132(86.84%) 110(83.97%) BAR CHART AND PIE CHART

Bariatric surgeries, 2010-2013

14.00% 12.76% 12.00% 10.35% 10.00% 9.32% 8.12% 8.00% 7.44% 5.75% 6.00% 4.46% 3.59% 4.00% 2.88% 2.00%

0.00% AGB LSG RYGB Diagnosis for Cholecystectomy patients, ED revisit Admitted from ED Discharged from ED 2006-2013 HISTOGRAM AND SCATTER PLOT

DISC measurement by group 70 60 50

40 Left 30 20 10 0 0 10 20 30 40 50 60 70 Right Control No Surgery Gamma-knife Resection BOX-PLOT

One continuous variable and one categorical variable OTHER THEORETICAL DISTRIBUTIONS

Variable Type of Outcome Theoretical Distribution Continuous numeric Normal, Log-normal, Exponential,… Discrete numeric Poisson, Negative Binomial,… Binary Bernoulli, Binomial,…. Categorical with multiple Multinomial, categories Hypergeometric,… CONFIDENCE INTERVALS

Population Sample Parameters: Statistics: 푿ഥ , 풔, 풑ෝ ,… 흁, 흈, 흅…

A point estimate alone is not enough: it gives us no way to judge how accurate it is as an estimator. A confidence interval provides a better estimate by combining the point estimate with its standard error to define a range of values that are likely to cover the true value of the parameter. A confidence intervals starts with the point estimate and adds a “margin of error.” A confidence interval is defined as: point estimate +/- margin of error. CONFIDENCE INTERVALS

95% CI for μ: P(-??<µ

Since by central limit theorem, 2 x ~ N(x , x ) x P(1.96 1.96) 0.95 / n

P(x 1.96 * x 1.96 * ) 0.95 n n CONFIDENCE INTERVALS

95% Confidence Interval (CI) for µ: x 1.96 * n Interpretation 1: You can be 95% sure that the true mean (μ) will fall within the upper and lower bounds. Interpretation 2: 95% of the intervals constructed using sample means (x) will contain the true population mean (μ). 100(1-α)% CI: x Z 1 / 2 n

A good link for simulation of CI: http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html The cartoon guide to Statistics by Gonick and Smith HYPOTHESIS TESTING

• Using data to test specific hypotheses • Making decisions based on probability (instead of subjective impressions) • Distribution is usually assumed • Methods that require no distributional assumptions are called non-parametric or distribution free WHAT IS A HYPOTHESIS

• A well formulated hypothesis will be both quantifiable and testable, that is, involve measurable quantities or refer to items that may be assigned to mutually exclusive categories. • Takes one of two forms: “ Some measurable characteristic of a population takes one of a specific set of values” “ Some measurable characteristic takes different values in different populations, the difference has a specific pattern or a specific set of values” EXAMPLES

This new drug will lower diastolic blood pressure.

For males over 40 suffering from chronic hypertension, a 100mg daily dose of this new drug will lower diastolic blood pressure an average of 10 mm Hg. BASIC DEFINITIONS AND NOTATION • The Null hypothesis describes some aspect of the statistical behavior of a set of data and is denoted H0 • This description is treated as valid unless the actual behavior of the data contradicts this assumption • The Alternative Hypothesis is generally the “opposite” of the null hypothesis and is denoted H1 HYPOTHESIS

• The null hypothesis is usually of the form

• The alternative will take on one of the following forms ERRORS IN HYPOTHESIS TESTING

The facts Decisions

No Difference Drug is Better No Difference Correct Type I error: Manufacturer wastes money developing an ineffective drugs

Drug is better Type II error: Correct Manufacturer misses opportunity for profit; Public denied access to effective treatment

There is a trade off between type I error, α, and type II error, β P-VALUE

Definition: the probability of obtaining a test statistic as extreme as or more extreme than the actual test statistic obtained, given that the null hypothesis is true.

Other explanation: the α level at which we would be indifferent between accepting or rejecting H0 given the sample data at hand or the α level at which the given value of the test statistic is on the borderline between the acceptance and rejection region BASIC STEPS IN HYPOTHESIS TESTING

1. State null (H0) and alternative (H1) hypotheses 2. Choose a significance level, α (usually 0.05 or 0.01) 3. Based on the sample, calculate the test statistic and calculate p-value based on theoretical distribution 4. Compare p-value with the significance level 5. Make a decision, and state the conclusion RELATIONSHIP BETWEEN HT AND CI • If we wish to conduct a two-sided test of a hypothesis regarding a population parameter with significance level α, we can do this by constructing a 100(1-α)% confidence interval and checking to see if the hypothesized value is in the interval • In this manner, CIs can be used to conduct Two-Sided Hypothesis Tests

95% Confidence Variable Group Estimate P-value Interval Control vs. No surgery 0.421 0.239-0.741 0.0036

absolute Control vs. Surgery 0.276 0.158-0.482 <0.0001 difference No surgery vs. Surgery 0.655 -0.372-1.156 0.1400 Gamma-knife vs. Resection 0.420 0.190-0.925 0.0322 NONPARAMETRIC TESTS

• Usually we assume an underlying distribution and the methods used have been based on these assumptions • Such methods are parametric statistical methods since the parametric form of the distribution is assumed to be known • If these assumptions are not reasonable and/or the central limit theorem cannot be applied, nonparametric procedures should be used NONPARAMETRIC TESTS

• These tests can be used in situations where the data are ordinal or even binary (Yes/No) • For quantitative data, if there are extreme values, ranking mediates these values • In these situations, assumptions may be violated for parametric tests such as t-test, especially for small samples and non-parametric methods are more appropriate SUMMARY OF COMMON NONPARAMETRIC TESTS

Normal theory-based Corresponding Purpose of test test nonparametric test To study the central tendency sign test; Wilcoxon of a single sample One sample t-test signed rank test To compare central tendencies of two t-test for two independent Wilcoxon rank-sum test independent samples samples (Mann-Whitney U test) To examine a set of Wilcoxon signed-rank differences paired t-test test To assess the linear association between two Pearson correlation Spearman rank variables coefficient correlation coefficient To compare three or more one way analysis of groups variance (ANOVA) Kruskal-Wallis test The cartoon guide to Statistics by Gonick and Smith

42 POWER IS AFFECTED BY

43 • Sample size: N – ↑ N → power ↑ • Significance level: α – ↑ α → power ↑ • Effect size: δ – ↑ δ → power ↑ • Variation in the continuous outcome: σ2 – ↓ σ2 → power ↑ • One-tailed vs. two-tailed tests – Power is greater in one-tailed tests than in comparable two-tailed tests SAMPLE SIZE FORMULA BASICS

44 • Variables of interest – type of data e.g. continuous, categorical • Desired power • Desired significance level • Effect/difference of clinical importance • Standard deviations of continuous outcome variables • One or two-sided tests

Depends on study design Not hard, but can be VERY algebra intensive WHILE CONSULTING WITH A BIOSTATISTICIAN

Written summary materials are preferred: readable & organized o Background information about the problem o A proposal, protocol, or statement of work o Schematics, such as diagram or flow-chart o Information about any existing database o prior studies, pilot data, tests, published and in-house reports relating to the problem Make sure the biostatistician understand your needs - Avoid having a good solution to a wrong question Any issues in data collection (e.g. missing data) or deviations from the study protocol (e. g. randomization before baseline tests verified eligibility).

The biostatistician should be a coauthor given substantive input from the biostatistician. But always get explicit consent first. WHEN TO CONSULT A BIOSTATISTICIAN

Most effective way: to include a biostatistician from the very beginning of a research project. After data have been collected: • Bring complete, detailed description of the study design and conduct • Bring clear exposition of the questions to be addressed • Note: research questions may not be answered by collected data

Once the data are analyzed: • Checking if conclusions fit the analysis results • Suggesting better ways to describe and display the data • Assuring no erroneous or incomplete statements about the findings • Sometimes data re-analysis may be needed

When You Consult a Statistician... What to Expect (2007) Berman N, Gansky S, Guillon C, Loughin T, Sanchez M (2003) Please check our website for future lectures https://osa.stonybrookmedicine.edu/research-core-facilities/bcc/education

THANK YOU!