Introduction to Biostatistics

Jie Yang, Ph.D.

Associate Professor
Department of Family, Population and Preventive Medicine

Director

Biostatistical Consulting Core
Director

Biostatistics and Bioinformatics Shared Resource, Stony Brook Cancer Center

In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC).

OUTLINE

 What is Biostatistics

 What does a biostatistician do

• Experiment design, clinical trial design

• Descriptive and Inferential analysis

• Result interpretation

 What you should bring while consulting

with a biostatistician

WHAT IS BIOSTATISTICS

• The science of (bio)statistics encompasses

 the design of

biological/clinical experiments

 the collection,

summarization, and analysis of data from

those experiments

 the interpretation of, and inference from, the

How to Lie with Statistics (1954) by Darrell Huff.

results

http://www.youtube.com/watch?v=PbODigCZqL8

GOAL OF STATISTICS

Sampling

Probability Theory

POPULATION

SAMPLE

Descriptive Statistics
Descriptive Statistics

Inference

Sample

Population

Parameters:

흁, 흈, 흅…

Inferential Statistics

Statistics:

ഥ

ෝ

푿 , 풔, 풑 ,…

PROPERTIES OF A “GOOD” SAMPLE

• Adequate sample size (statistical power)

• Random selection (representative)

Sampling Techniques:

1.Simple random sampling 2.Stratified sampling

3.Systematic sampling

4.Cluster sampling 5.Convenience sampling

STUDY DESIGN

EXPERIEMENT DESIGN

 Completely Randomized Design (CRD)

- Randomly assign the experiment units to the treatments
 Design with Blocking – dealing with nuisance factor which has

some effect on the response, but of no interest to the experimenter; Without

blocking, large unexplained error leads to less detection power.

1. Randomized Complete Block Design (RCBD) - One single blocking factor

2. Latin Square

Design (two blocking factor)
3. Cross over Design

(each subject=blocking factor)
4. Balanced Incomplete

Block Design

EXPERIMENT DESIGN

 Factorial Design: similar to randomized block design, but allowing to test the interaction between two treatment effects. A significant interaction between A and B tells:
• the effect of A is different at each level of B. Or the

effect of B differs at each level of A.

• it is not very sensible to even be talking about the main

effect of A and B

 Experiment with random factors: randomly select n of

the possible levels of the factor of interest. Typically random factors are categorical.
 Split-plot Design: confounding a main effect with blocks

EVIDENCE PYRAMID

IMPACT Observatory: tracking the evolution of clinical trial data sharing and research integrity - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Evidence-pyramid_fig1_309019368 [accessed 14 Jan, 2019]

WHAT CAN A STATISTICIAN HELP

DURING STUDY DESIGN PHASE

 Blinding/masking and randomization

The number and combination of experimental inventions The timing of measurements or visits

Collect information on a larger sample or on the same

sample over time
Ways to maximize the efficient use of the available resources

Even for data management – how to code measures

and what to computerize directly affect the ease even the feasibility of subsequent analysis

How a biostatistician analyzes data

TYPE OF DATA
1. Nominal data: unordered categories or classes

e.g. gender, blood type, transplant type

2. Ordinal data: order among categories is important

e.g. disease severity, AE level

3. Discrete data: both ordering and magnitude are important;

often integers or counts, no intermediate values are possible

e.g. # of accidents within a month, # of kids in a family

4. Continuous data: difference between two possible data

values can be arbitrarily small

e.g. height, weight, body temperature, serum level, BP

5. Time to event data: censoring presents

e.g. overall survival

DESCRIPTIVE STATISTICS

 General goal is to describe the distribution of a single

variable (center, spread, shape, functional form)  Helpful for checking data and assumptions  Stratified (by group) analysis can be done for groups of interest

 Values and comparisons can be visualized and

“estimated” but descriptive statistics alone will provide no

information about our level of confidence in conclusions

1. Measure of central tendency

 Mean: average

 Median: the 50^thpercentile point (median

value);
 Mode: value that occurs most frequently;

unimodal and multimodal

• Reporting a measure of center gives only partial

information about a data set.

– Example: Consider the following three datasets:

Dataset 1: 4 5 5 5 6

Dataset 2: 1 3 5 7 9 Dataset 3: 1 5 5 5 9

 All the three datasets have identical means and medians.

 Datasets 2&3 are more variable than the 1^stone.

• It is also important to describe the spread of values

about the center.

2. Measure of variability

• Range= Max –Min

• Inter-Quartile Range (IQR)=Q3-Q1

• Variance, Sample Variance • Standard Deviation, Sample Standard Deviation

IDENTIFYING POTENTIAL OUTLIERS

• An Outlying Value is a value, X, such that

X> Q3+ 1.5(IQR)

or

X< Q1–1.5(IQR)

• An Extreme Outlying Value is a value, X, such

that

X> Q3+ 3(IQR)

or

X< Q1–3(IQR)

EFFECTS OF OUTLIERS
 Median and IQR are generally unaffected by the removal

of outliers but minor changes are possible.

 Mean and Standard Deviation will be affected by the outlying values.

 Apparent shape of the distribution can also be affected by

outlying values.
 One should never simply remove data values from a

dataset.

 In practice, if the outliers are not errors, sensitivity analysis will often be conducted or robust statistical

methods will be used.

WAYS OF PRESENTING DATA

• Summary table

• Bar/Pie chart • Histogram

• Scatter plot

• Boxplot

1. Outlier

2. Extreme Outlier 3. Modified Boxplot
SUMMARY TABLE

1. By one variable

side left right
N14 14
Mean 18.83 18.61
SD
6.04 5.48
Median
18.25 17.75
Min 8.00 8.80
Max
30.10 28.21

Variable N_missing Level Total (N=25) Case (N=12) Control (N=13)
No 24 (96.00%) 11 (91.67%) 13 (100.00%)

Cancer

0

Yes

1 (4.00%)

1 (8.33%)

0 (0.00%)

2. By multiple categorical variables

Radiation

Sequence with

Surgery
Location of

Before 2002

107(19.21%) 450(80.79%)
20(13.16%)

132(86.84%)
After 2002
Tumor

Preoperative Postoperative Preoperative

Postoperative

65(15.66%)
350(84.34%)
21(16.03%)

110(83.97%)

Lower
(n = 972)

Upper

(n = 283)

BAR CHART AND PIE CHART

Bariatric surgeries, 2010-2013

14.00% 12.00% 10.00%
8.00% 6.00% 4.00% 2.00% 0.00%

12.76%
10.35%
9.32%

8.12%

7.44%
5.75%

2.88%
4.46%
3.59%

AGB

LSG

Admitted from ED
RYGB

Diagnosis for Cholecystectomy patients, 2006-2013

ED revisit

Discharged from ED

HISTOGRAM AND SCATTER PLOT

DISC measurement by group
70

60 50 40 30 20 10
0

0

10 20 30 40 50 60 70

Right

Control

No Surgery

Gamma-knife

Resection

BOX-PLOT

One continuous variable and one categorical variable

OTHER

THEORETICAL DISTRIBUTIONS

Variable Type of Outcome Theoretical Distribution

Continuous numeric

Discrete numeric Binary

Normal, Log-normal,

Exponential,…

Poisson, Negative

Binomial,… Bernoulli, Binomial,….

Categorical with multiple Multinomial,

categories Hypergeometric,…

CONFIDENCE INTERVALS

Sample

Population

Parameters:

흁, 흈, 흅…

Statistics:

ഥ

ෝ

푿 , 풔, 풑 ,…

A point estimate alone is not enough: it gives us no way to

judge how accurate it is as an estimator. A confidence interval provides a better estimate by combining the point estimate with its standard error to define a range of

values that are likely to cover the true value of the parameter.

A confidence intervals starts with the point estimate and adds a

“margin of error.” A confidence interval is defined as: point

estimate +/- margin of error.

95% CI for μ: P(-??<µ<??)=0.95

_x²

)

Since by central limit theorem,

x ~ N(_x

,

x  
 / n
P(1.96 

 1.96)  0.95