<<

Brief outline of basic

S/Lecturer Maia Lesosky Division of & , University of Cape Town [email protected] Thanks to Landon Myer for slides

“These are the topics that need to be covered”

1. definitions of , and 2. interpretation of confidence intervals and p values 3. sensitivity and specificity 4. positive and negative predictive values 5. interpretation of parametric and non parametric data tests commonly used- students T test, Chi square analysis, Fishers exact test, ANOVA 6. correlation 7. risk reduction and numbers needed to treat 8. relative risk and hazard ratio 9. 10. interpretation of kappa values 11. Principles of this talk More detail here than you need à Will go quickly, you must stop to ask questions à These are the basics, you will be asked to apply them Many things here are gross simplifications You want this Terminology varies à have kept as general as possible, noted synonyms Outline

I. Measurements & distributions (20%) – describing distributions II. Making comparisons (70%) – Statistical tests, p-values & CI – Regression, survival analysis III. Evaluating measurements (10%) – Validity, reliability I. Measurements & distributions Measurements

• Broadly 3 kinds of measurements in health sciences (“variables”)

– Numeric measures • Continuous • Discrete

– Categorical measures • Polytomous – Ordinal vs nominal • Binary

– Time-based measures • Time-to-event, survival Examples

• Systolic blood pressure • Mortality • ALT • TMN staging • Gender/sex • Time to remission

Note: can make categories of continuous measures NB: Numeric measure NB: Numeric Haemoglobin Haemoglobin

Continuous measure (g/dL)

Polytomous measure (low, medium, high) NB: Categorical measures NB: Categorical

Binary measure (low, high)

Distributions

• When we take measurements on many patients, we can describe measures as distributions

• How we describe a distribution will depend on the kind of measure Categorical distributions

• Describe in frequency distributions Can describe distribution of in terms of counts, percentages

Different units, same conclusions Continuous distributions

• We also describe in terms of frequency distributions • But we have many “categories” (“bins”) We draw summaries to describe the shape of the distribution There are other ways to show shapes of distributions

‘Box-and-whisker’ plots Many different possible shapes for continuous distributions Skewed distributions

Negative (left) skew – Long “tail” of distribution is to the left – Bulk of observations shifted right

Positive (right) skew – Long “tail” of distribution is to the right – Bulk of observations shifted left There are some “classic” distributions Ways to describe a distribution

• Measure of – Where the distribution clusters

• Measure of dispersion – How spread out the distribution is Measures of central tendency

• Mean – , average value

• Median – 50th , middle value

– Most commonly occurring value Quantiles (regular intervals of a distribution)

1, 2, 3, 4, 5, 6, 7, 8 …… 95, 95, 96, 97, 98, 99, 100 • Deciles 1-10, 11-20, 21-30 ….. 81-90, 91-100 • Quintiles 1-20, 21-40, 41-60, 61-80, 81-100 • Quartiles 1-25, 26-50, 51-75, 76-100 • Tertlies 1-33, 34-67, 67-100

Describing distributions

Note: if the data are normally distributed, the mean is a good measure of central tendency If the data are non-Normal, the median is better measure of central tendency Measures of dispersion (‘spread’)

– Minimum value to Maximum value

– Average distance between each point and the mean

• Standard deviation – Square root of variance

– 25th percentile to 75th percentile Variance 3 distributions: same mean value, but different We have a favourite distribution

Normal distribution ~ “Gaussian distribution”, “standard normal distribution”

Remember: standard deviation is just the square root of variance We like the Normal distribution because it has some well-defined features

In a Normal distribution, 95% of the data falls within 1.96 standard deviations of the mean value

(here, 95.46% within 2 standard deviations)

• This is where a 95% comes from – 95% confidence interval around a mean value is ± 1.96 standard deviations around the mean value – Sometimes we are lazy and call it ± 2 SD around the mean

• 95% CI is a generic – It will come up elsewhere (same concept, different application) • We often manipulate (“transform”) variables so that we can make them “Normal” – Common manipulations include logarithms, square roots, or squares

• Eg, log HIV viral loads There are many different standard distributions with well-defined features

• Gaussian (Normal) is most common

• Others – Z-distribution, T-distribution, F-distribution – Chi-squared (χ2) distribution – Binomial distribution (for categorical data) – Poisson distribution (for counts of things)

• If the distribution of our measures follows a known distribution, we can make assumptions about our data based on rules of the known distribution – Eg, if our data are normally distributed, we know that 95% of data fall within 2 standard deviations of the mean value • These kinds of statistics are parametric statistics Non-parametric statistics

• If our measures really don’t look like any known distribution, we can’t make assumptions about it based on any standard distribution – We have to work with the actual values of our measurements • These are non-parametric statistics Example

There are parametric and non-parametric approaches to describing distributions

• If data are normally distributed – Mean and standard deviation (or variances) used to describe distributions

• If data are not normally distributed – Median and interquartile ranges (or just ranges) used to describe distribution II. Making comparisons • Sometimes we want to compare 2 distributions to each other – Are the distributions different from each other? – Is there an association between the two measures? • We can ask this question about different combinations of • Continuous measures – Normal or non-normal distributions • Categorical measures – Polytomous or binary Example: comparison of 2 distributions

Question: Is cholesterol associated with gender?

Serum cholesterol Serum cholesterol among women among men Example: comparison of 2 binary measures

Patients without TB Patients with TB

Question: Is TB disease associated with death? Statistical hypothesis testing

• There are different statistical tests that are applied in different situations to answer the question – Are the distributions of one variable different according to another variable Which is the same thing as – Is there an association between one measure and another measure?

• Different tests all give rise to p-values Statistical test for every situation

• Comparing 2 continuous variables to each other – Correlation coefficient • Comparing 2 categorical variables to each other – Chi-square test, Fisher’s exact test • Comparing a binary categorical variable to a continuous variable – Student’s T-test (parametric ~ if continuous variable is normally distributed) – Wilcoxon rank-sum test (=Mann-Whitney U-test) (nonparametric - if continuous variable not normally distributed) • Comparing a polytomous categorical variable to a continuous variable – ANOVA (parametric ~ if continuous variable is normally distributed) – Kruskall-Wallis test (=Mann-Whitney U-test) (nonparametric - if continuous variable not normally distributed) Correlation coefficient

• Correlation coefficients (usually “r”) used to examine association between 2 continuous variables

This graph is sometimes called a “scatterplot”

Chi-squared tests

• Used to examine the association between 2 categorical variables

Dead Alive no TB 26 128 TB+ 67 91

Note: Chi-square tests are parametric and used for larger sample sizes Fisher’s exact tests

Dead Alive no TB 2 6

TB+ 7 5

For smaller sample sizes we replace chi-squared with Fisher’s exact tests (non-parametric) They do the same thing but different formulae, much more calculations

Small sample size ~ table contains <60 total, or any cell <5 • Chi-squared tests and Fisher’s exact tests can be used to compare – 2 binary variables to each other (2x2) – Binary versus polytomous (eg, 2x3) – Polytomous versus polytomous (eg, 4x5) (Student’s) T-test

• Used to compare 2 normal distributions (parametric test)

• Whether 2 distributions are different depends on the size of the difference in AND how much variability is present Wilcoxon rank-sum test

(= Mann-Whitney U-test) • Non-parametric test • Compares 2 non-normal distributions – The non-parametric version of t-test

• “Comparing means”: t-test • “Comparing ”: rank-sum test ANOVA • ANOVA = • Used to test for any difference in mean values for >2 distributions

Parametric – requires Normally distributed data

Is there any difference between these 3 distributions? Kruskall-Wallis test

• Extension of Wilcoxon rank-sum test for comparing >2 groups at once

• Also = Mann-Whitney U test

• Non-parametric version of the ANOVA • Comparing 2 continuous variables to each other – Correlation coefficient

• Comparing 2 categorical variables to each other – Chi-square test, Fisher’s exact test

• Comparing a binary categorical variable to a continuous variable – Student’s T-test (parametric) – Wilcoxon rank-sum test (nonparametric)

• Comparing a polytomous categorical variable to a continuous variable – ANOVA (parametric) – Kruskall-Wallis test (nonparametric) Relative risks

• Often data from clinical research seeks to understand whether patients with some pre-existing status (‘exposure’) may be more/less like to develop some subsequent health outcome – Cohort studies – Randomised controlled trials

• Imagine a trial Dead Alive randomising 100 patients to receive Drug A 12 88 drug A and 100 Drug patients to receive B 37 63 drug B, then following them over time to observe survival

• We could calculate a chi-square test here, but not very useful clinically (but only tells us “”) • Often we prefer to calculate the relative risk (risk ratio or rate ratio) Relative risk Proportion of all the exposed (here, drug A) patients developing the outcome divided by Proportion of all unexposed (here, drug B) patients developing the outcome

Dead Alive Drug 12 / (12 + 88) = 0.12 A 12 88 0.12 / 0.37 = 0.33 Drug 37/ (37+ 63) = 0.37 B 37 63 Interpreting the relative risk

• Relative risk is how much more (or less) likely the health outcome is in one group relative to the other – Here, death is 0.32 times as likely (ie, less likely) in patients receiving drug A relative to patients receiving drug B • Note: if the risk of the outcome is the same in both arms, the relative risk is 1 • If ‘exposure’ is protective, RR < 1 • If ‘exposure’ is detrimental, RR > 1 Confidence intervals (again)

• We can calculate confidence intervals (CI) around this relative risk – Here, the interval is (0.19 – 0.61)

• The CI gives a range of estimates for the RR that the observed data (from the table) are consistent with

– Narrow CI ~ precise estimate of RR (good)

– Wide CI ~ imprecise estimate or RR (bad) Absolute risk reduction (risk difference) • Like relative risk, but subtract instead of divide Proportion of all the exposed (here, drug A) patients developing the outcome minus Proportion of all unexposed (here, drug B) patients developing the outcome

Dead Alive

Drug 12 / (12 + 88) = 0.12 A 12 88 0.12 - 0.37 = - 0.25 Drug 37/ (37+ 63) = 0.37 B 37 63 • Absolute risk reduction (risk difference) tells us how the risk of the health outcome changes when the exposure is taken away – Here, risk of death drops by 0.25 (25%) when patients receive drug A compared to drug B

• Note: if the risk of the outcome is the same in both arms, the risk reduction (risk difference) is 0 • If ‘exposure’ is protective, RR < 0 • If ‘exposure’ is detrimental, RR > 0 Numbers needed to treat

• The average number of patients who need to receive an intervention (here, Drug A) to prevent 1 outcome from happening

• Calculated as 1 / (risk reduction)

• Here, 1 / 0.25 = 4 – On average, 4 patients need to receive drug A instead of drug B to prevent 1 death P-values

• P-values provide a measure of “statistical significance” from any statistical test that compares 2 things

– “universal currency” of statistical comparison

– Helps us understand the role of chance in explaining an association Interpreting p-values

• P-values ~ probabilities ~ range from 0 - 1

• P-value’s formal definition based on hypothesis testing – Evaluates the probability of null hypothesis • Null hypothesis ~ usually that there is no association between variables – P-value: the probability of observing the data in your study if the null hypothesis is true Practical interpretation of p-values

• Large p-value:

The association observed between the 2 variables in your data is consistent with the hypothesis of no association between variables

– “Association is not statistically significant”

– “No statistically significant difference” • Small p-value: the association observed between the 2 variables in your data is NOT consistent with the hypothesis of no association – “Association is statistically significant” – “Statistically significant difference”

• Smaller p-value à association less consistent with chance finding – “Statistically significant” = not consistent with chance Small vs Large

• We traditionally use 0.05 as a cut off for a “statistically significant” p-value • This is arbitrary rule-of-thumb – 0.048 it not very different from 0.053 • Another guide – >0.1 = not statistically significant – 0.05-0.1 = approaching statistical significance – 0.001-0.05 = statistically significant – <0.001 = highly statistically significant Sample sizes • “Statistical significance” (the size of a p- value) is determined by a few things, most importantly

– The size of the difference in the measure you are looking at

AND

– The number of patients (sample sizes) involved For example

Dead Alive Dead Alive no no TB 2 8 TB 20 80

TB+ 6 4 TB+ 60 40

The proportions in the 2 tables are the same (calculate the risk ratios to see this) But the p-value for the table on the left is 0.17 The p-value for the table on the right is <0.001 P-values and CI

• P-values and CI are closely related – Calculated from the same place – A small p-value suggests narrow CI • more precise, good – A large p-value suggest wide CI • less precise, bad • CI for an RR that do not overlap 1 mean the corresponding p-value is <0.05 ‘statistically significant’ Example: interpret the following

RR = 1.9; 95% CI= 1.4 – 2.8; p=0.008 – Outcome about 2x more common in exposed vs unexposed; narrow CI, statistically significant RR = 0.8; 95% CI = 0.2 – 4.8; p=0.37 • Outcome slightly less common in exposed vs unexposed; wide CI, not statistically significant RR = 1.02; 95% CI= 0.5 - 2.0; p=0.98 – Not much difference in frequency of outcome between exposed and unexposed, not statistically significant Regression models

• All the statistical tests we have looked at so far only look at the association between two variables at a time

• But sometimes we want to look at the associations involved between >2 variables at once – Regression models commonly used for this Concept of regression

• Equation used to predict an outcome variable (y) according to the one or more predictor variables (x’s) • Basic equation for a line Y = intercept + slope * X

Here we’re interested in the slope à relationship between X and Y

Note: There are many different kinds of regression Application of regression models in medical research

• Regression models used to look at how multiple factors combine predict a health outcome – Especially adjustment for variables • Equations like Y = intercept + (slope*X) + (slope*R) + (slope*Z) Would you be used to understand how a certain health outcome (Y) is predicted by 3 different factors (X, R, Z) Survival analysis

• Survival analysis uses time-to-event measures • “Survival” can mean time until death – Or any other specific outcome • Remission, Cure, Relapse, Need for admission – Any binary outcome studies over time • Note: survival analysis from cohort studies or RCTs – Need to follow patients over time Kaplan-Meier plots Kaplan-Meier survival analyses to compare survival in 2 groups over time

Hazard ratios

• There is a particular kind of regression model for survival analysis: Cox’s proportional hazards model • Model gives us hazard ratios – Like the distance between 2 survival curves – Interpreted exactly like relative risks • So how would you interpret: – HR > 1 – HR < 1 – HR = 1 III. Evaluating measurements Evaluating a new test

• We often want to know how well a certain test performs in detecting a condition of interest – We may be interested in screening for a condition or diagnosing it • Test of interest may be – Laboratory assay, radiological investigation • We want to know how test performs – Identifying those with disease, those without

• We study this by comparing the new test to an established gold-standard (representing the truth)

False True True Positives pos neg

Test pos A B A+B Test neg C D C+D

False Negatives A+C B+D Ways of evaluating a new test

• Sensitivity – The proportion of people who truly have disease who are detected correctly by the test • Specificity – The proportion of people who do truly do not have disease who are detected correctly by the test

These are features of tests, but not actually what is needed by a clinician at bedside • Positive predictive value – What proportion of people who have a positive test truly have disease • Negative predictive value – What proportion of people who have a negative test truly do not have disease

These are of greater clinical interest, but are problematic in reality • Sensitivity = A / (A+C) – High sensitivity means few false negatives • Specificity = D / (B+D) – High specificity means few false positives

True True pos neg

Test pos A B A+B Test neg C D C+D

A+C B+D • Positive Predictive Value= A / (A+B) – High PPV means few false positives • Negative Predictive Value = D / (C+D) – High NPV means few false negatives

True True pos neg

Test pos A B A+B Test neg C D C+D

A+C B+D Example: raised IFN-γ in detecting culture-confirmed TB in smear-negative patients

TB TB cult cult neg neg

Raised Calculate IFN-γ 75 125 200 – Sens Normal – Spec IFN-γ 25 175 200 – PPV 100 300 – NPV Interpretations: Sens, Spec

• Sensitivity and specificity of raised IFN-γ are 75% and 58%, respectively in detecting culture-confirmed TB

– This means that 75% of patients with culture- confirmed TB will have raised IFN-γ (test pos)

– And that 58% of patients without disease will not have raised IFN-γ (test neg) Interpretations: PPV, NPV

• PPV and NPV of raised IFN-γ are 38% and 88%, respectively in detecting culture- confirmed TB

– This means that 38% of patients with a raised IFN-γ will truly have culture-confirmed TB

– And that 88% of patients with a low IFN-γ will truly not have TB General rules

• Higher sensitivities usually mean lower specificities (good tests are high on both)

• High sens à few false negatives à high NPV – If a test has a high sens, negative result helps rule out disease (SnOUT) • High spec à few false positives à high PPV – If a test has a high spec, positive result helps rule in disease (SpIN) Reliability

• There are some situations in which there is no “gold-standard” to compare with • We then compare the reliability (repeatability) of measures • Example: – radiologists identifying lesions on scan – pathologists identifying malignancy on biopsy – psychiatrists making any diagnosis • In these situations we compare the repeatability of measures

Radiologist Radiologist A: positive A: negative

Radiologist B: positive 31 19 50

Radiologist B: negative 21 79 100

52 98 150 • Here it’s not clear who the gold-standard is à can’t really calculate sens, spec, etc. Kappa

• Instead we want to see the amount of agreement between the raters

• Kappa = the degree of agreement of 2 raters above chance – Measure of test-retest reliability – Range, -1 (perfect disagreement) to 1 (perfect agreement)