MGH/MPH301 H19 and with special reference to Social Epidemiology

ADNAN NOOR BALOCH SCHOOL OF AND COMMUNITY THE SAHLGRENSKA ACADEMY

2019-09-10 MGH/MPH301 H19 Epidemiology and Biostatistics with special reference to Social Epidemiology

• Schedule – canvas The computer sessions report • Teaching material – canvas • CS_1 due 11 september • Computer lab - canvas • CS_2 due 20 september • Computer session • CS_3 due 25 september – Be familiar with SPSS before • CS_4 due 2 october – Read through instructions before • CS_5 due 9 october – Due dates • CS_6 due 18 october

• Examination: answering research questions by analyzing data

2 Tips on books & online resources 1. Medical : A textbook for the health sci. (Campbell;Machin;Walters) E-book @ t GU-Lib 2. SPSS for Applied Sciences (E-book @ GU-Library) 3. Statistics at Square One (BMJ Publishing Group) 4. Online resource to learn SPSS 5. Online resource to learn 6. Regression Methods in Biostatistics (Vittinghoff; Glidden;Shiboski; McCulloch) E-book @ GU-Library

3 LECTURE 1 A) CLASSIFICATION OF VARIABLES AND THE CHOICE OF ANALYSIS B) SAMPLE SIZE CALCULATIONS

ADNAN NOOR BALOCH SCHOOL OF PUBLIC HEALTH AND COMMUNITY MEDICINE THE SAHLGRENSKA ACADEMY

2019-09-10 What is statistics?

• The science that studies the methods necessary to compile, analyze and interpret data.

• In research, statistical theory can be used as a help when constructing studies, collecting data and drawing conclusions.

• Medical statistics deals with applications of statistics in medicine, health sciences and epidemiology.

5 Why use statistics? Population and sample • There is a variation in the results • Practically (and/or ethically) impossible to collect data from the whole population • How to deal with uncertainty in a sample? • How to use what we observe in a sample to make inferences about the population?

6 (Based on figure 1.2 in the JB) Population and sample

Population Parameters value:

2μ (e.g. average survival time) Variance: σ Population Inference (“an educated guess”) Sample data is used to draw conclusions about the population i.e. the sample statistics is estimates of the population parameters

Sample Statistics Mean value: Sample Variance: 2𝑥𝑥̅ 𝑠𝑠 7 Steps when planning a study

8 Planning (sample size determination)

How many individuals should be included is a very important issue in the design (planning) of a study.

Factors to consider are:

• The minimum difference of interest to be detected, or the precision in estimate. • The degree of variability among the individuals.

• The statistical power, i.e. the ability to detect the smallest difference of interest.

9 Variables used for different purposes

Direct object of interest

• Outcome variable (response/dependent): blood pressure, survival, incidence, quality of life Indirect object of interest (can affect outcome variable)

• Treatment variable (predictor/independent/covariate):

treatment group, dose, frequency, date of treatment

• Background variables (independent variable or covariate):

age, sex, clinical variables, education, etc.

10 Data collection: types of variables & Levels of measurement

11 Types of scales

• Nominal scale: Name, Gender, color, brand Is the ”lowest” level. No particular order

• Ordinal scale: Good-Better-Best, Age category Has a natural order

• Interval scale: Temperature Has an natural order and a defined “distance” (negative, 0, positive)

• Ratio scale: Weights Has an natural order, “distance” and a “Zero-value”

12 Statistical inference Statistical inference is drawing conclusion about the population based characteristics of the sample. There two types of statistical inferences: 1. Hypothesis testing (using a test/CI) 2. Estimation ()

13 Hypothesis testing The idea is that we have a question about the population that we want to try to answer with a statistical test. A requirement of any statistical test is to formulate a null hypothesis (H0) and an alternative hypothesis (H1).

14 Hypothesis testing 1. Formulate

i. Null hypothesis (H0): Often cautiously formulated; can never be proven; considered true until disproved

ii. Alternative hypothesis (HA): includes everything except the null hypothesis 2. Test the hypothesis using a statistical test, which results in a p-value/confidence interval 3. If p-value below level of significance (decided a

priori) reject (H0) else we do not have enough evidence to reject H0

15 Parametric and non-parametric tests The type (quantitative, qualitative) and the distribution of the outcome variable • Parametric tests: the outcome variable is, E.g., normally distributed; mean and standard deviation are parameters . You test assumptions about the parameters at population level (ex: mean) • Non-parametric tests: you don’t know (can’t assume) any distribution for the outcome variable . You test assumptions about the population independent of any distribution (ex: or the whole distribution)

16 17 Discovering Statistics Using IBM SPSS Statistics by Andy Field 5th Edition ISBN-13:978-1-5264-1952-1

18 Type of errors

True state of nature is true is true Our Do Not Correct Type-2 error 𝑯𝑯𝟎𝟎 𝑯𝑯𝟏𝟏 decision reject decision (β) Reject Type-1 error Correct decision 𝑯𝑯𝟎𝟎 (α) (power) (1 - β) 𝑯𝑯𝟎𝟎

• Type 1 Error: The probability (α) to reject H0 when H0 is true. “False Positive”

• Type 2 Error: The probability (β) to not reject H0 when

H0 is false. “False Negative”

• The Power of a test (1- β) is the probability to reject H0 19 when H0 is false. Significance Level

The significance level (α) should be chosen before any analyses are done. It specifies the highest acceptable risk for erroneously rejecting H0 (i.e. making a type 1 error). E.g. the significance level of a 95% CI is 0.05 (1-0.95) or 5% We accept that 5% of our CIs won't cover the true value, hence lead us the a wrongful conclusion.

20 Statistical Power

In a study you want as high a power as possible There are two ways to increase the power: 1. Select a higher significance level (α) 2. Collect a larger sample

The problem is that: 1. You don’t usually want a higher significance level (why not?) 2. It can be expensive to collect a larger sample

21 Experimental Design

Design of Experiments is the area of statistics that helps us planning experiments and how to gather data to achieve good (or optimal) inference

1. How large does a sample need to be so that a confidence interval will be no wider than a given size?

2. How large does a sample need to be so that a hypothesis test will have a low p-value if a certain alternative hypothesis is true ?

22 Experimental Design

Sample size is dependent on

1. How large effects (differences) are we looking for? (d or Δ)

2. How big is the variation (std. deviation) in data?

3. Significance level, (Type-1 error)

4. Power (usually 80 %; 90 %)

5. Study design: balanced/unbalanced, cohort with 2 or more than 2 group comparison; longitudinal data; number of covariates (explanatory / independent variables), etc

23 Sample size calculation

The needed sample size is calculated under the assumption that H0 is true. • You have to make assumptions about: – The size of the standard deviation. • Often based on earlier studies with the same variable. • Make the following choices in advance – The smallest effect you want to detect, d or δ. – Which method to use • e.g. t-test – Significance level, the lower it is the larger the sample • Often 5%, i.e. α = 0.05 – The intended power • Often 80%, i.e. 1-β = 0.80. 24 Sample size calculation – 2 sample mean

Let say that I’m going to do a study there I want to be at least 80% sure to find any effect of size d (or larger), if I use a test with a 5 % significance level This that for an effect (group difference) of size d, I want 1-β = 0.8 and α = 0.05. With these criteria I can , before the study starts, calculate how many participants I need (i.e. the size of n)

25 Standarized effect size

One important aspect is the variability in outcome variable • E.g. If both groups similar (low std. deviation) outcome a smaller sample is needed to detect smaller effect sizes

• It is a common practice to use standardized effect size ( ), which is ∆ = • Δ = . δ Anticipated difference Anticipated std deviation Campbell, M.𝝈𝝈 J., Machin, D., & Walters, S. J. Medical statistics : a textbook for the health sciences.

26 Standarized effect size

Standarized effect size 0.20 Small-medium An average member of intervention group had better outcome than 60% of control group member 0.50 Medium – large An average member of intervention group had better outcome than 70% of control group member

0.8 Large An average member of intervention group had better outcome than 80% of control group member

Cohen J, Statistical Power Analysis for the Behavioral Science, 2nd Edition (1988) 27 Sample size calculation for mean difference We want to calculate the needed sample size for detecting a difference between two independent groups

= 16 2 𝝈𝝈𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 for 2-sided α = 0.05 and𝑚𝑚 Power of∗80% (formula 14.1 on p 267*) 𝜹𝜹𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 A trial of cognitive behavioural therapy. Outcome = Hospital Anxiety and Depression scale (HADS) with values 0 (not anxious or distressed) to 21 (very anxious or distressed). A change of 2 points is clinically important. Previous published studies, Standard Deviation( ) = 4 points.

= = 2 (moderate’ effect) gives m = __ 𝝈𝝈patients per 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝝈𝝈 4 28 group𝜹𝜹𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 or 2__ patients in all. *Another formula We want to calculate the needed sample size for detecting a difference between two independent groups with the same standard deviation.

( ( = 2 𝛼𝛼 = 2. 𝑧𝑧 1− )+𝑧𝑧 1−𝛽𝛽) 2 2 𝜎𝜎 2 : the kth percentiles of the standard normal distribution (i.e. the critical δ δ value in a normal𝑚𝑚 distribution𝜎𝜎 ⋅ table for = k) 𝜃𝜃 𝑧𝑧𝑘𝑘 : the assumed sample standard deviation. 𝛼𝛼 m: the required sample size per group. 𝜎𝜎 1 : The desired power : The desired significance level − 𝛽𝛽 : the smallest difference in group mean that we want to detect with 𝛼𝛼 probability (1 ), using a t-test with significance level α. δ

− 𝛽𝛽 29 Sample size calculation for proportions/prevalence Student Union plans to perform a telephone survey, to estimate the proportion of un-satisfied students at Sahlgrenska Academy. How large of a sample is needed for a 95% confidence interval to have a width of 12% (or within ±6.)

( = 𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝒅𝒅𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 ∗ _ 𝟏𝟏 − 𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝑬𝒅𝒅𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑) 𝒎𝒎 𝟐𝟐 𝑬𝑬𝑬𝑬𝑬𝑬= 𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 _ 𝑺𝑺𝑺𝑺 . 𝑾𝑾𝑾𝑾𝑾𝑾𝑾𝑾𝑾𝑾 𝒐𝒐𝒐𝒐 𝑪𝑪𝑪𝑪 𝑺𝑺𝑺𝑺𝑬𝑬𝑬𝑬𝑬𝑬 𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 𝟐𝟐 ∗ 𝟏𝟏 𝟗𝟗𝟗𝟗

30 *Campbell, M. J., Machin, D., & Walters, S. J. (2010). Medical statistics : a textbook for the health sciences. 31 Binary data – Labour in water Ninety-nine pregnant women, with dystocia (difficult childbirth or labour), RCT to receive ‘immersion in water in a birth pool’ vs ‘Standard’  Intervention: 49 women  Standard: 50 women To evaluate the impact of labouring in water during the first stage of labour. Epidural analgesia

Yes No Intervention 23 (47 %) 26 (53 %) Control 33 (66 %) 17 (34 %)

Sample size calculation for 50% vs 65 % ? For 2-sided α = 0.05 and Power of 80% with 15 % reduction in outcome.

Solve from table 14.1 on next slide!

32 Binary data

33 Example: RTW among cancer survivors

We are interested in a study about RTW among cancer survivors with two-sided α = 0.05 and power = 80%. Two groups, Group-1 vs Group-2 i. Equal groups ii. Un-equal groups •How large of a sample is needed for detecting at least 20% difference in risk among groups with Grp-1 = 62% RTW

34 Example: RTW among cancer survivors

Computed N Computed N per Group (EQ) Total (3:1) Index Power N Index Power N 1 0.20 36 1 0.20 108 2 0.30 59 2 0.30 172 3 0.40 83 3 0.40 236 4 0.50 109 4 0.50 304 5 0.60 138 5 0.60 384 6 0.70 174 6 0.70 476 7 0.80 221 7 0.80 600 8 0.90 295 8 0.90 792

35 Example: RTW among cancer survivors

36 Power Calculation: 2-Sample mean

| | 1 𝑑𝑑̅ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ≈ − Φ 𝑧𝑧1−𝛼𝛼⁄2 − 𝑆𝑆𝑆𝑆 𝑑𝑑̅ | |: the absolute value of sample mean difference between the group (k Cumulative probability density function for the standard normal 𝑑𝑑̅ distribution. Gives us the probability to get a value of X larger than k, given that XΦ belongs): to a standard normal distribution. : the kth percentiles of the standard normal distribution (i.e. the critical value in a normal distribution table for = k) 𝑧𝑧𝑘𝑘 the standard error of the 𝛼𝛼mean sample difference. 𝑆𝑆𝑆𝑆 𝑑𝑑̅ :

37 Power Calculation example: weight loss Difference, Mean -3.3 – ( -2.1) = 1.2 ∆ 3.49 SE 0.79 𝑆𝑆𝑆𝑆𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 1.2 1 1.96 = 1 0.67 = 0.33 0.79 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ≈ − Φ − − The probability to obtain a sample that reject H0: = 0.79 𝒅𝒅�=0 if in fact 𝑑𝑑̅=1.2 (and 𝑆𝑆𝑆𝑆 𝑑𝑑̅ ) is about 33%. In 33% of the H0 experimentsHow can we ofincrease this sample this probability?size conducted on this population, if is in fact false, we will be able to reject it. E.g. by increasing the sample sizes → decrease in The larger the sample the larger is power. 𝑆𝑆𝑆𝑆 ∆ 38 Some toolboxes for power and sample size calculations

OpenEpi (http://www.openepi.com/Menu/OE_Menu.htm) Epidemiological / statistical analysis and calculations using OpenEpi. 1. download & install 2. work online

WINPEPI (Program for Epidemiologists for windows) Very good tool for epidemiologists; download free at http://www.brixtonhealth.com/pepi4windows.html. Performs variety of statistical tests, both for planning and analysis. Use Compare2 / PAIRSetc modules o calculate "power and sample size

SAS Power & Sample Size (PSS) 3.1 SAS PSS is a simple and user-friendly tool requires some background in statistics.

Epi Info™: free tool from CDC (USA), user-friendly and enables you to quickly create databases and analyse and report them.

39 Some toolboxes for power and sample size calculations

SISA (Simple interactive Statistical Analysis) (http://www.quantitativeskills.com/sisa/index.htm)

A free resource for different statistical and epidemiological calculations. You can easily navigate between different pages for statistical analyses; power and sample size estimations.

Although all of these resources have been tested thoroughly, use is at your own risk.

Adnan

40 Suggested readings

Suggested reading i. Campbell, M., Machin, D., & Walters, S. (2010). Chapter 14. Sample size issues, Medical Statistics : A Textbook for the Health Sciences. New York: John Wiley & Sons, Incorporated. Available online https://ebookcentral- proquest-com.ezproxy.ub.gu.se/lib/gu/detail.action?docID=624690#

Optional reading i. Gelman, A., & Hill, J. (2006). Chapter 20. Sample size and power calculations. Data Analysis Using Regression and Multilevel/Hierarchical Models (Analytical Methods for Social Research, pp. 437-456). Cambridge: Cambridge University Press. Available online https://ebookcentral-proquest- com.ezproxy.ub.gu.se/lib/gu/detail.action?docID=288457 ii. Liu, G. (2005). Sample Size in Epidemiologic Studies. In Encyclopedia of Biostatistics (eds P. Armitage and T. Colton). Uploaded in same folder

41 Exercise session / computer lab

Read the questions in Exercises 1 & Exercise 3 and try to solve with paper/pen and/or with some statistical toolbox. All the formulas/solutions are given in these lecture slides (30, 32) & table 14.1 & 14.2

We will discuss the solutions and you will solve Exercises 2 during computer session on 11th of September

42