Introduction to Biostatistics
Jie Yang, Ph.D.
Associate Professor
Department of Family, Population and Preventive Medicine
Director
Biostatistical Consulting Core
Director
Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center
In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).
OUTLINE
What is Biostatistics
What does a biostatistician do
• Experiment design, clinical trial design
• Descriptive and Inferential analysis
• Result interpretation
What you should bring while consulting
with a biostatistician
WHAT IS BIOSTATISTICS
• The science of (bio)statistics encompasses
the design of
biological/clinical experiments
the collection,
summarization, and analysis of data from
those experiments
the interpretation of, and inference from, the
How to Lie with Statistics (1954) by Darrell Huff.
results
http://www.youtube.com/watch?v=PbODigCZqL8
GOAL OF STATISTICS
Sampling
Probability Theory
POPULATION
SAMPLE
Descriptive Statistics
Descriptive Statistics
Inference
Sample
Population
Parameters:
흁, 흈, 흅…
Inferential Statistics
Statistics:
ഥ
ෝ
푿 , 풔, 풑 ,…
PROPERTIES OF A “GOOD” SAMPLE
• Adequate sample size (statistical power)
• Random selection (representative)
Sampling Techniques:
1.Simple random sampling 2.Stratified sampling
3.Systematic sampling
4.Cluster sampling 5.Convenience sampling
STUDY DESIGN
EXPERIEMENT DESIGN
Completely Randomized Design (CRD)
- Randomly assign the experiment units to the treatments
Design with Blocking – dealing with nuisance factor which has
some effect on the response, but of no interest to the experimenter; Without
blocking, large unexplained error leads to less detection power.
1. Randomized Complete Block Design (RCBD) - One single blocking factor
2. Latin Square
Design (two blocking factor)
3. Cross over Design
(each subject=blocking factor)
4. Balanced Incomplete
Block Design
EXPERIMENT DESIGN
Factorial Design: similar to randomized block design, but allowing to test the interaction between two treatment effects. A significant interaction between A and B tells:
• the effect of A is different at each level of B. Or the
effect of B differs at each level of A.
• it is not very sensible to even be talking about the main
effect of A and B
Experiment with random factors: randomly select n of
the possible levels of the factor of interest. Typically random factors are categorical.
Split-plot Design: confounding a main effect with blocks
EVIDENCE PYRAMID
IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019]
WHAT CAN A STATISTICIAN HELP
DURING STUDY DESIGN PHASE
Blinding/masking and randomization
The number and combination of experimental inventions The timing of measurements or visits
Collect information on a larger sample or on the same
sample over time
Ways to maximize the efficient use of the available resources
Even for data management – how to code measures
and what to computerize directly affect the ease even the feasibility of subsequent analysis
How a biostatistician analyzes data
TYPE OF DATA
1. Nominal data: unordered categories or classes
e.g. gender, blood type, transplant type
2. Ordinal data: order among categories is important
e.g. disease severity, AE level
3. Discrete data: both ordering and magnitude are important;
often integers or counts, no intermediate values are possible
e.g. # of accidents within a month, # of kids in a family
4. Continuous data: difference between two possible data
values can be arbitrarily small
e.g. height, weight, body temperature, serum level, BP
5. Time to event data: censoring presents
e.g. overall survival
DESCRIPTIVE STATISTICS
General goal is to describe the distribution of a single
variable (center, spread, shape, functional form) Helpful for checking data and assumptions Stratified (by group) analysis can be done for groups of interest
Values and comparisons can be visualized and
“estimated” but descriptive statistics alone will provide no
information about our level of confidence in conclusions
DESCRIPTIVE STATISTICS
1. Measure of central tendency
Mean: average
Median: the 50th percentile point (median
value);
Mode: value that occurs most frequently;
unimodal and multimodal
DESCRIPTIVE STATISTICS
• Reporting a measure of center gives only partial
information about a data set.
– Example: Consider the following three datasets:
Dataset 1: 4 5 5 5 6
Dataset 2: 1 3 5 7 9 Dataset 3: 1 5 5 5 9
All the three datasets have identical means and medians.
Datasets 2&3 are more variable than the 1st one.
• It is also important to describe the spread of values
about the center.
DESCRIPTIVE STATISTICS
2. Measure of variability
• Range= Max –Min
• Inter-Quartile Range (IQR)=Q3-Q1
• Variance, Sample Variance • Standard Deviation, Sample Standard Deviation
IDENTIFYING POTENTIAL OUTLIERS
• An Outlying Value is a value, X, such that
X> Q3+ 1.5(IQR)
or
X< Q1–1.5(IQR)
• An Extreme Outlying Value is a value, X, such
that
X> Q3+ 3(IQR)
or
X< Q1–3(IQR)
EFFECTS OF OUTLIERS
Median and IQR are generally unaffected by the removal
of outliers but minor changes are possible.
Mean and Standard Deviation will be affected by the outlying values.
Apparent shape of the distribution can also be affected by
outlying values.
One should never simply remove data values from a
dataset.
In practice, if the outliers are not errors, sensitivity analysis will often be conducted or robust statistical
methods will be used.
WAYS OF PRESENTING DATA
• Summary table
• Bar/Pie chart • Histogram
• Scatter plot
• Boxplot
1. Outlier
2. Extreme Outlier 3. Modified Boxplot
SUMMARY TABLE
1. By one variable
side left right
N14 14
Mean 18.83 18.61
SD
6.04 5.48
Median
18.25 17.75
Min 8.00 8.80
Max
30.10 28.21
Variable N_missing Level Total (N=25) Case (N=12) Control (N=13)
No 24 (96.00%) 11 (91.67%) 13 (100.00%)
- Cancer
- 0
- Yes
- 1 (4.00%)
- 1 (8.33%)
- 0 (0.00%)
2. By multiple categorical variables
Radiation
Sequence with
Surgery
Location of
Before 2002
107(19.21%) 450(80.79%)
20(13.16%)
132(86.84%)
After 2002
Tumor
Preoperative Postoperative Preoperative
Postoperative
65(15.66%)
350(84.34%)
21(16.03%)
110(83.97%)
Lower
(n = 972)
Upper
(n = 283)
BAR CHART AND PIE CHART
Bariatric surgeries, 2010-2013
14.00% 12.00% 10.00%
8.00% 6.00% 4.00% 2.00% 0.00%
12.76%
10.35%
9.32%
8.12%
7.44%
5.75%
2.88%
4.46%
3.59%
- AGB
- LSG
Admitted from ED
RYGB
Diagnosis for Cholecystectomy patients, 2006-2013
- ED revisit
- Discharged from ED
HISTOGRAM AND SCATTER PLOT
DISC measurement by group
70
60 50 40 30 20 10
0
- 0
- 10 20 30 40 50 60 70
Right
- Control
- No Surgery
- Gamma-knife
- Resection
BOX-PLOT
One continuous variable and one categorical variable
OTHER
THEORETICAL DISTRIBUTIONS
Variable Type of Outcome Theoretical Distribution
Continuous numeric
Discrete numeric Binary
Normal, Log-normal,
Exponential,…
Poisson, Negative
Binomial,… Bernoulli, Binomial,….
Categorical with multiple Multinomial,
categories Hypergeometric,…
CONFIDENCE INTERVALS
Sample
Population
Parameters:
흁, 흈, 흅…
Statistics:
ഥ
ෝ
푿 , 풔, 풑 ,…
A point estimate alone is not enough: it gives us no way to
judge how accurate it is as an estimator. A confidence interval provides a better estimate by combining the point estimate with its standard error to define a range of
values that are likely to cover the true value of the parameter.
A confidence intervals starts with the point estimate and adds a
“margin of error.” A confidence interval is defined as: point
estimate +/- margin of error.
CONFIDENCE INTERVALS
95% CI for μ: P(-??<µ<??)=0.95
x2
)
Since by central limit theorem,
x ~ N(x
,
x
/ n
P(1.96
1.96) 0.95
-
-
P(x 1.96 * x 1.96 * ) 0.95
- n
- n
CONFIDENCE INTERVALS
95% Confidence Interval (CI) for µ:
x 1.96 * n
Interpretation 1: You can be 95% sure that the true mean
(μ) will fall within the upper and lower bounds.
Interpretation 2: 95% of the intervals constructed using sample means (x) will contain the true population mean
(μ).
x Z1 / 2
100(1-α)% CI:
n
A good link for simulation of CI: http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html
The cartoon guide to Statistics by Gonick
and Smith
HYPOTHESIS TESTING
• Using data to test specific hypotheses
• Making decisions based on probability
(instead of subjective impressions)
• Distribution is usually assumed
• Methods that require no distributional
assumptions are called non-parametric or distribution free