Marco Ferran Spau (Students)

Outline

• Why Statistics? Statistics for non-statisticians • Descriptive Statistics. . Populations and Samples. . Type of errors • Inferential Statistics. Hypothesis testing . Statistical errors . p-value . Confidence Intervals • Multiplicity issues. Type of tests. Sample size Marco Pavesi Lead Statistician • Multivariate analysis. More on p-values Liver Unit – Hospital Clínic i Provincial • Conclusion: “little shop of horrors” Ferran Torres Statistics and Methodology Support Unit. Hospital Clínic Barcelona Biostatistics Unit. School of Medicine. Universitat Autònoma Barcelona (UAB)

Inducción y Verdad Bertrand Russell presents…

Intro. Why should we learn statistics ?

The inductivist turkey

Troubles for the plain researchers: Smart turkeys / researchers…

. Induction and statistics ARE NOT a method to get a sort of mathematical demonstration of Truth

. The results observed for a population sample are not necessarily true for the whole population

1) …are aware that the relevance (weight) of statistical inferences always depends on the sample size

1 Smart turkeys / researchers… Smart turkeys / researchers…

2) …do know that we can only model /estimate the 3) …understand that true hipotheses do not exist, and real world with a specific approximation error. we can only reject or keep a hypothesis based on the available evidence

What is statistics ? So, why statistics? To account for chance & variability!

• “I know (I’m making the assumption) that these dice are fair: what is the probability of always getting a 1 in 15 runs?“ ==> Probability mathematics

• “I have got always a 1 in 15 runs. Are these dice fair ?” ==> Inferential STATISTICS

Why is Statistics needed?

• Statistics tells us whether events are likely to have happened simply by chance Introduction to descriptive statistics • Statistics is needed because we always work with sample observation (variability) and never with populations

• Statistics is the only mean to predict what is more likely to happen in new situations and helps us to make decisions

2 Population and Samples Random vs Sistematic error

Example: Systolic Blood Pressure (mm Hg)

Sample Random Systematic (Bias)

Study Population True Value True Value

130 150 170 130 150 170

01 02 03 04 05 01 05 Target Population 02 03 04

What Statistics? The mean and the median

Arithmetic mean (average): • Descriptive Statistics Median: 1,3,3,4,6,13,14,14,18  6 (50% of sample individuals have a value . Position statistics (central tendency measures): mean, median 1,3,3,4,6,13,14,14,17,18  6 - 13 higher than or equal to the median) Median=(6+13)/2=9.5

. Dispersion statistics: variance, standard deviation, standard error • Unlikely the median, the mean is affected by outliers • Especially relevant for specific distributions (survival times) . Shape statistics: symmetry, skewness and kurtosis measures.

Mean 1 Mean 2 New outlier Median 1 Median 2

Dispersion measures Inference & tests

• The Variance is the mean of • Inferential Statistics squared differences from the distribution mean: . Draw conclusions (inferences) from incomplete (sample) data. • The Standard Deviation is the . Allow us to make predictions about the target population based on the square root of the Variance: results observed in the sample . Are computed in hypothesis testing

• The Standard Error is generally 2 expressed as the ratio between the SE = σ / N • Examples Variance and the sample size: . 95%CI’, t-test, chi square test, ANOVA, regression • It is considered as the true SD of the population mean (or parameter)

3 Basic pattern of statistical tests How many noise units?

• Test statistic & sample size (degrees of freedom) convert to a probability or P Value.

• Based on the total number of observations and the size of the test statistic, one can determine the P value.

Overall hypothesis testing flow chart

Test Statistics value

Corresponding P-value (from known distribution) Introduction to inferential statistics

Comparison with significance level (previously defined)

P < α P >= α

Reject null hypothesis Keep null hypothesis

Extrapolation

Sample Study Results

Inferential analysis Statistical Tests Confidence Intervals

Population “Conclusions”

4 Statistical Inference Valid samples? Population

Statistical Tests=> p-value Likely to occur

Invalid Sample and Conclusions Confidence Intervals Unlikely to occur

P-value A intuitive definition

• The p-value is a “tool” to answer the question: • The p-value is the probability of having observed our data when . Could the observed results have occurred by chance*? the null hypothesis is true

p < .05 • Steps: “statistically significant” 1) Calculate the treatment differences in the sample (A-B) 2) Assume that both treatments are equal (A=B) and then… . Remember: 3) …calculate the probability of obtaining a magnitude of at least the – Decision given the observed results in a SAMPLE observed differences, given the assumption 2 4) We conclude according the probability: a. p<0.05: the differences are unlikely to be explained by random, – Extrapolating results to POPULATION we assume that the treatment explains the differences b. p>0.05: the differences could be explained by random, *: accounts exclusively for the random error, not bias we assume that random explains the differences

HYPOTHESIS TESTING RCT from a statistical point of view

Treatment A • Testing two hypotheses . H0: A=B (Null hypothesis – no difference) . H1: A≠B (Alternative hypothesis)

• Calculate test statistic based on the assumption that H0 is true (i.e. there is no real difference) Randomisation • Test will give us a p-value: how likely are the collected Treatment B (control) data if H0 is true

• If this is unlikely (small p-value), we reject H0

1 homogeneous population 2 distinct populations

5 RCT Statistical significance/Confidence

• A>B p<0.05 means:

• “I can conclude that the higher values Sample Population observed with treatment A vs treatment B are linked to the treatment rather to chance, with a risk of error of less than ? 5%”

Factors influencing statistical significance P-value

• A “very low” p-value do NOT imply:

• Signal • Difference . Clinical relevance (NO!!!)

• Noise (background) • Variance (SD) . Magnitude of the treatment effect (NO!!) • Quantity • Quantity of data With ↑n or ↓variability ⇒ ↓p

• Please never compare p-values!! (NO!!!)

P-value THE BASIC IDEA

• A “statistically significant” result (p<.05) tells us NOTHING about clinical or scientific • Statistics can never PROVE importance. Only, that the results were not anything beyond any doubt, just due to chance. beyond reasonable doubt!!

A p-value does NOT account for bias • … because of working with STAT REPORT samples and random error only by random error

6 Type I & II Error & Power Type I & II Error & Power

• Type I Error (α) . False positive . Rejecting the null hypothesis when in fact it is true . Standard: α=0.05 . In words, chance of finding statistical significance when in fact there truly was no effect

• Type II Error (β) . False negative . Accepting the null hypothesis when in fact alternative is true . Standard: β=0.20 or 0.10 . In words, chance of not finding statistical significance when in fact there was an effect

Type I & II Error & Power 95%CI

• Power • Better than p-values… . 1-Type II Error (β) . …use the data collected in the trial to give an estimate of the treatment . Usually in percentage: 80% or 90% (for β =0.1 or 0.2, respectively) effect size, together with a measure of how certain we are of our estimate . In words, chance of finding statistical significance when in fact there is an effect • CI is a range of values within which the “true” treatment effect is believed to be found, with a given level of confidence. . 95% CI is a range of values within which the ‘true’ treatment effect will lie 95% of the time

• Generally, 95% CI is calculated as . Sample Estimate ± 1.96 x Standard Error

Superiority study

Interval Estimation

Control better Test better

IC95%

d < 0 d = 0 d > 0 - effect No differences + effect

7 Superiority study

Control better Test better

Multiplicity

IC95%

d < 0 d = 0 d > 0 - effect No differences + effect

Lancet 2005; 365: 1591–95

Design Conduction Results

8 Interim Analyses in the CDP

(Month 0 = March 1966, Month 100 = July 1974) Coronary Drug Project Mortality Surveillance Circulation. 1973;47:I-1 http://clinicaltrials.gov/ct/show/ NCT00000483;jsessionid=C4EA2EA9C3351138F8CAB6AFB723820A?order=23

Sample Size

Lancet 2005; 365: 1657–61

9 Sample Size Sample Size

• The planned number of participants is calculated on the basis • The planned number of participants is calculated on the basis of: of:

. Expected effect of treatment(s) ↗ eﬀect ↘ number . Expected effect of treatment(s) ↗ eﬀect ↘ number

. Variability of the chosen endpoint ↗ variability ↗ number . Variability of the chosen endpoint ↗ variability ↗ number

. Accepted risks in conclusion ↗ risk ↘ number . Accepted risks in conclusion ↗ risk ↘ number

Sample Size

• The planned number of participants is calculated on the basis of:

. Expected effect of treatment(s) ↗ eﬀect ↘ number

. Variability of the chosen endpoint ↗ variability ↗ number

. Accepted risks in conclusion ↗ risk ↘ number

Normal vs. Skewed Distributions Examples of Normal and Skewed

• Parametric statistical test can be used to assess variables that have a “normal” or symmetrical bell-shaped distribution curve for a histogram.

• Nonparametric statistical test can be used to assess variables that are skewed or non -normal.

• “Inferential tests” vs Look at a histogram to decide.

10 Parametric vs. Nonparametric The type of Inferential Tests depend on data

• Student’s t-test • Mann-Whitney U test • Repeated measures ? • One-way ANOVA • Kruskal-Wallis test . UnMatched groups: different subsets of the population in each condition: • Paired t-test • Wilcoxon signed-rank – Independent data (paired) . Matched groups : the same individuals in each condition: • Pearson correlation • Spearman’s r – dependent data (unpaired) • Correlated F ratio • Friedman ANOVA (repeatedmeasures ANOVA) • Type of data . Continuous Gaussian, Metric  mean, SD, …. . Continuous non-Gaussian, ordinal Ranks: 1,2,3,4,5,6,7,8,9,10 Median, interquartile range . Nominal, Categories: 49% “yes”, 33 “no”, 18% “no opinion”, frequencies and percentages

Quantitative independent variable, Independent (unpaired) data

Qualitative dependent variable

Quantitative independent variable, dependent (paired) data

11 A Good Rule to Follow

• http://statpages.org/ • Always check your results with a nonparametric • http://www.microsiris.com/Statistical%20Decision%20Tree/ (sensitivity analysis) • http://www.socialresearchmethods.net/selstat/ssstart.htm • http://www.wadsworth.com/psychology_d/templates /student_resources/workshops/stat_workshp/chose_stat • If you test your null hypothesis with a Student’s t-test, /chose_stat_01.html also check it with a Mann-Whitney U test. • http://www.graphpad.com/www/Book/Choose.htm • It will only take an extra 25 seconds.

• Use common sense and prior knowledge!!

2 or 3 more things on p-values

• P-values only depend on the magnitude of the test statistic computed based on observed (sample) data. Multivariate statistics: • They are related to the evidence against the null hypothesis and why and when ? tell us how confortable we should feel when we reject it.

• They are not related in any way to the clinical relevance of the Marco Pavesi “signal” (or effect, or difference, or whatever result) observed !! Lead Statistician Liver Unit – Hospital Clínic i Provincial Barcelona

Clinical study design chart Randomization

YES EXPERIMENTAL (Ex. Randomized Clinical Trial) STUDY 1. Eliminates assignment bias

Any intervention 2. Tends to produce comparable groups for known and unknown, applied & recoded and unrecorded factors studied? Design Sources of Imbalance Repeated NO measurements Randomized Chance taken? Concurrent (prospective) Chance & Selection Bias (Non-randomized) Historical (retrospective) Chance, Selection Bias & Time Bias NO YES (Non-randomized)

PROSPECTIVE CROSS-SECTIONAL 3. Adds validity (extrapolability) to the results of statistical tests STUDY STUDY Reference: Byar et al (1976) NEJM

(Ex. Cohort study designs) (Ex. Case-control study designs)

12 Confounding Interactions

• No randomization  Lack of homogeneity between groups in the • Effect modification distribution of risk (protection) factors • Different risk (effect) estimates are associated to different strata • A potential confounder is: of a specific factor. . Associated to the outcome . Associated to the main factor studied 20% . Not involved in the causal association between factor and outcome as a midway step Factor A, stratum 2 Outcome (ex. Female) associated to a EXPOSURE OUTCOME 10% (coffee intake) (stroke) specific factor “A” (ex. death)

Factor A, stratum 1 7% (ex. Male)

CONFOUNDING FACTOR (smoking) Factor B, stratum 1 Factor B, stratum 2 (ex. Age < 65) (ex. Age ≥ 65)

Multivariate analysis and statistical models Predictive models

• A model is “a simplified representation (usually mathematical) • Used when we are interested in predicting the probability of a specific used to explain the workings of a real world system or outcome or the value of a specific dependent variable event” (Wikipedia) • Focused on selection of the best subset of predictors and highest precision of estimates • Two types of statistical models are used in clinical reasearch /epidemiology: • The selection of predictors is based on their contribution to the predictive ability of the model (i.e., on p-values) . Predictive models . Explanatory models • Ex. Framingham equations to predict the probability of developing coronary events at 10 years • Both are fitted by means of multivariate analysis techniques (http://www.framinghamheartstudy.org/risk/index.html)

Framingham predictive equation for CHD Explanatory models

Estimated Coefficients Underlying CHD Prediction Sheets Using Total Cholesterol Categories Variable Men Women Age,-y 0.04826 0.33766 • Study objective: to assess (estimate) the effect of a specific Age-squared,-y -0.00268 TC,-mg/dL factor on the study outcome <160 -0.65945 -0.26138 • Multivariate analysis aimed at getting the best (most valid) 160-199 Referent Referent 200-239 0.17692 0.20771 estimate of the studied effect 240-279 0.50539 0.24385 Confounders must be accounted for in the model >=280 0.65713 0.53513 • HDL-C,-mg/dL • Evaluation of confounding variables is based on the change of <35 0.49744 0.84312 35-44 0.2431 0.37796 model estimates, NOT ON STATISTICAL SIGNIFICANCE. 45-49 Referent 0.19785 50-59 -0.05107 Referent >=60 -0.4866 -0.42951 • Rule of thumb: add each potential confounder into the model one Blood-pressure by one and keep only those modifying by more than 10% the Optimal -0.00226 -0.53363 Normal Referent Referent estimate of the main factor High-normal 0.2832 -0.06773 Stage-I-hypertension 0.52168 0.26288 Stage-II-IV-hypertension 0.61859 0.46573 Diabetes 0.42839 0.59626 Smoker 0.52337 0.29246

Baseline-survival-function-at-10-years, S0(10) 0.90015 0.96246 Linear predictor at risk factor means 3.09750 9.92545

13 Outcome variables and statistical models Adjusting for confounders: an example A summary table

• Continuous (normally distr.) outcome: . ANOVA, or ANCOVA or Linear Regression model • Bivariate (YES/NO): . Logistic regression • Categorical (with a ref.group): . Multinomial logistic regression • Time-to-event (different follow-up times & censored cases): . Survival models (ex. Cox PH) • Number of counts: . Poisson or Negative Binomial regression

The p-value…

… is the probability of a result like that observed in our sample when the null hypothesis is true in the population (i.e., simply due to Some “take home” hints chance)

…is related to the evidence against the null hypothesis and to the reliability of the observed result Marco Pavesi Lead Statistician …IT DOES NOT TELL US ANYTHING ON THE CLINICAL RELEVANCE Liver Unit – Hospital Clínic i Provincial OF THE RESULT WE HAVE OBSERVED !! Barcelona

Interpretation of a p-value Evidence and p-value: an example (1)

• The highest the p-value, the highest the probability that the Drug A. Efficacy rate: 22% observed result is due simply to chance: Drug B. Efficacy rate: 11%

p = 0.75  75% probability (3 out of 4 studies) to reject a true H0 …observed results:

p = 0.015  1.5% probability (15 out of 1,000 studies) to reject a true H0 Drug A. Efficacy rate: 2 / 9 Drug B. Efficacy rate: 1 / 9 • A “small” p-value (significance level) is established conventionally as P-value=0,98 the highest rate of false-positive results that we consider acceptable (for instance, the common 5% rate)

14 Evidence and p-value: an example (2) Evidence and p-value: an example (3)

Drug A. Efficacy rate: 22% ….on the other hand… Drug B. Efficacy rate: 11% Drug A. Known efficacy rate: 50% …observed results: Drug B. Expected efficacy rate: 52%

Drug A. Efficacy rate: 35 / 154 Δ=2%; Type I error: 0.05; Type II error: 0.20 Drug B. Efficacy rate: 18 / 158 N (per arm): 9.806 P-value=0,008

Conclusion: little shop of horrors (1) Conclusion: little shop of horrors (2)

• “No significant difference is observed between the treatment • The p-value of the comparison A vs. Placebo is lower than the p arms. -value for the comparison B vs. Placebo Conclusion: the treatments are equally effective…” Conclusion: treatment A is better than B…”

… … AAAAAARGH ! AAAAAARGH ! !!! !!! • The p-value gives us a measure of the evidence against that • “Absence of evidence is not evidence of absence” specific null hypothesis in that specific hypothesis test. (Altman DG, Bland JM. BMJ 1995;311:485)

Conclusion: little shop of horrors (3)

• A clinician speaking to the poor, helpless statistician: “Can we just test variable A vs. the rest of variables and check if some difference is significant…?”

… AAAAAARGH ! !!! • Type I error increases exponentially together with the number of hypothesis tests performed: 1 test: Type I error = 5%……5 tests: Type I error > 20%