THE UNIVERSITY OF ADELAIDE

Biostatistics III

Lecture Notes Associate Professor Patty Solomon

School of Mathematical Sciences

Semester 2, 2007 Contents

1 Introduction 1 1.1 What is ? ...... 1 1.2 What are clinical trials? ...... 3 1.3 Randomization ...... 4

2 The design and analysis of clinical trials 8 2.1 Phases of trials ...... 8 2.2 Key aspects of trial design ...... 9 2.3 Methods of randomization ...... 10 2.3.1 Simple (or complete) randomization ...... 11 2.3.2 Restricted randomization ...... 12 2.3.3 Biased coin designs (BCD) ...... 14 2.3.4 Minimization ...... 17 2.3.5 Stratification ...... 20 2.3.6 Randomization tests ...... 20 2.3.7 Randomized consent designs ...... 23 2.4 Trial size ...... 24 2.4.1 Introduction ...... 24 2.4.2 Fixed trial size (non-sequential analysis) ...... 26 2.4.3 Sequential Trials ...... 35 2.5 Crossover trials ...... 41 2.5.1 Introduction ...... 41 2.5.2 Model formulation ...... 44 2.5.3 Analysis ...... 46 2.6 Equivalence trials ...... 57

3 Epidemiology and observational studies 59 3.1 Introduction ...... 59 3.2 Cohort Studies ...... 61 3.3 Case-control Studies ...... 62 3.4 Other designs ...... 63 3.5 Binary responses and case-control studies ...... 64 3.6 Estimation and inference for measures of association ...... 67 3.6.1 Finding the approximate variance in a cohort study ...... 69 3.7 Attributable risk ...... 73 3.7.1 Estimation of AR ...... 75

4 Inference for the 2x2 table 77 4.1 Introduction ...... 77 4.2 Wald tests ...... 77 4.3 Likelihood Ratio test ...... 80 4.3.1 Profile Likelihood ...... 82 4.3.2 Conditional Inference ...... 85

5 Tests based on the likelihood 90 5.1 Wald test statistic ...... 90 5.2 Likelihood ratio test statistic ...... 91 5.3 Score test statistic ...... 93 III

Biostatistics III Course Coverage

• Design and analysis of clinical trials • Statistical epidemiology

1 Introduction

1.1 What is epidemiology?

• There is no standard definition: but broadly, it is the study of death and diseases in human populations.

• Problem: epidemiology is not often experimental, and this leads to prob- lems in statistical analysis and interpretation.

• → we can establish association, but not causation.

A common epidemiological question: is a particular disease or illness associated with age, sex, ..., or lifestyle factors, life experiences, or environmental factors, ...? For example:

• do mobile phones cause brain tumours?

• will human consumption of genetically modified crops lead to cancer later in life?

• which breast cancers are inherited (i.e., a case of nature versus nurture)?

An early example: Snow’s map of the London Cholera epidemic, 1854. The greatest achievement of statistical epidemiology: was establishing link between smoking and lung cancer (before the biological link was observed).

Epidemiology encompasses:

• chronic disease epidemiology

• infectious disease epidemiology

• genetic epidemiology

c School of Mathematical Sciences, University of Adelaide 1 Biostatistics III

• environmental epidemiology

• occupational epidemiology

• disease surveillance ... and so on. Examples of chronic diseases: asthma, heart disease, cancer

• Do radioactive particles cause childhood leukemia? New Scientist, 2004, 19/7

• Are there long term effects of eating GM crops? New Scientist, 2004, 26/7

• Does traffic pollution cause asthma?

• Will wearing ties make you go blind?

Examples of infectious diseases: measles, malaria, meningitis, SARS, influenza, HIV/AIDS

• Which MMR (measles, mumps, rubella) vaccination strategies are opti- mal?

• Are mosquito nets or insecticides more effective at preventing malaria?

• How great is the threat of bioterrorism? anthrax, small pox

And diseases are global: if you catch cold in Africa, the first sneeze may be back in Adelaide!

HIV/AIDS remains one of the biggest threats:

• globally, it is one of the top five causes of death

• the main burden falls on developing nations

• in Swaziland, Botswana:

- the infection rate is 40% - life expectancy is 38 years - 10% of households are headed by children

c School of Mathematical Sciences, University of Adelaide 2 Biostatistics III

HIV/AIDS disease progression

HIV infection seroconversion AIDS death ↓ ↓ ↓ ↓ ... antibodies diagnosis detectable | {z } incubation period Incubation period for AIDS:

• median ∼ 10 years + increasing

• long and variable

• treatment effects, AZT, HAART

[See Assignment 1 and AAO video # 26.]

1.2 What are clinical trials?

Clinical trials are designed medical experiments. They have a long history (see handout article from Encyclopedia of Biostatistics, 1998), although modern clinical tri- als date from the 20th century. Although not without problems and controversies of their own, clinical trials avoid the difficulties associated with statistical epidemiology. Examples:

• Would prescription heroin prevent long-term drug use? (People ran- domized to methadone only arm likely to drop out.)

• Does tamoxifen prevent primary breast cancer in women?

The key step is randomization: the use of chance to allocate patients to treatments. The idea is: patients differ only by accidents of randomization, or the treatment they receive. Clinical trials enable us to establish causality.

The gold standard is a:

• randomized

c School of Mathematical Sciences, University of Adelaide 3 Biostatistics III

• controlled

• double-blind (or single- or triple-) clinical trial. Example: Early AZT trial (AAO video #26). Randomized: patients randomized to zidovudine or placebo. Controlled: placebo group provided baseline for comparison. Blind: neither patient nor doctor knew which treatment group; analyst also blinded (triple-blind). What is the purpose of these features? Note though, that the ‘gold standard’ is not always attainable.

1.3 Randomization

The first randomized experiments were in agriculture, in which the experimental units were plots of land, and the treatments were crops or fertilizers. The pioneering statistical work was by R.A. Fisher in 1920’s in agricultural experi- ments. An important difference: patient entry into clinical trials is ‘staggered’, often over many years, and the data usually accumulate gradually. This affects both the con- duct and analysis of the trial. (If Fisher had worked in clinical trials, we may specu- late that modern trial designs would have evolved 80 or more years ago!)

Illustration of randomization: Suppose the effects of two treatments, A and B, on lowering blood pressure are to be compared; the response Y is continuous. Suppose eight patients are available for the study. How should we allocate four patients to treatment A, and four to treatment B?

(1) Suppose the first four are given A, the next four, B

AAAABBBB

This is called the randomization list. How could this allocation lead to confounding of treatment effects?

c School of Mathematical Sciences, University of Adelaide 4 Biostatistics III

(2) Try alternating A and B:

ABABABAB

But this also runs risk of confounding (and potential selection bias). How?

We need an objective method of allocating treatments to patients.

(3) Best to use randomization, which means choosing an allocation at ran- 8 dom, such that each of the possible arrangements are equally likely. 4

Randomization often enables us to obtain unbiased estimates of treatment differ- ences even in the presence of unsuspected systematic variation. To see how randomization works, consider the following. Our assumed model is

Yij = αi + ij

i = A, B indicates treatment j = 1,..., 4 patient within treatment

αi treatment effects

ij measurement errors i.i.d. zero mean 2 Var(ij) = σ

Yij response of patient j receiving treatment i

We want the treatment difference, so the ‘target quantity’ is αA − αB. So the natural estimator is ¯ ¯ YA. − YB., where 4 4 1 X 1 X Y¯ = Y , Y¯ = Y . A. 4 Aj B. 4 Bj j=1 j=1

However, the true model is Yij = αi + γij + ij,

c School of Mathematical Sciences, University of Adelaide 5 Biostatistics III

where γij is a ‘patient effect’ representing (unknown) systematic variation, e.g., dis- ease state at randomization. We can demonstrate that under randomization, these effects average out. To do this, ¯ ¯ we need to study the statistical properties of YA. − YB. under the true model. Now, ¯ ¯ YA. − YB. = αA − αB + (¯γA. − γ¯B.) + (¯A. − ¯B.)

Thus, for any given (i.e., fixed) treatment allocation, the only variation is measure- ment error, so that ¯ ¯ E(YA. − YB.) = αA − αB + (¯γA. − γ¯B.) | {z } nuisance component

since E(¯A.) = E(¯B.) = 0. ¯ ¯ That is, for a given treatment allocation, YA. − YB. is a biased estimator of true treat- ment difference. ¯ ¯ We now take expectations of E(YA.−YB.) over the randomization distribution, which 8 attaches probability 1/ 4 to every possible treatment allocation (i.e., every possible sequence of A’s and B’s.

Intuitively, this implies that any four of the γij are equally likely to be in the same treatment group, i.e., ER(¯γA.) = ER(¯γB.) by symmetry, where ER denotes expectation with respect to the randomization dis- tribution. This implies that ER(¯γA. − γ¯B.) = 0, so we obtain ¯ ¯ ER{E(YA. − YB.|R)} = αA − αB, and known or unknown patient effects average out under randomization. Remark 1: We can show that the usual estimate of standard error is approximately unbiased too.

In the usual situation in which γij = 0, we know that   ¯ ¯ 2 1 1 Var(YA. − YB.) = σ + nA nB and we estimate σ2 by

2 2 2 (nA − 1)sA + (nB − 1)sB sp = . nA + nB − 2

c School of Mathematical Sciences, University of Adelaide 6 Biostatistics III

Here, ( 4 4 ) 1 X X s2 = (Y − Y¯ )2 + (Y − Y¯ )2 , p 6 Aj A. Bj B. j=1 j=1 2 and we can show (but won’t) that sp(1/4 + 1/4) is an approximately unbiased esti- ¯ ¯ mator of Var(YA. − YB.). Remark 2: Randomization forms the basis of important classes of testing procedures known as randomization and permutation tests.

In summary: randomization

• protects against confounding variables and avoids bias (including selec- tion bias)

• provides the basis for formal inference

• facilitates the use of blinding (or masking)

• facilitates the use of a control group.

See handout from the British Medical Journal, ‘Why randomize’ for reasons against non-random assignment, such as systematic allocation, or historical controls. Example: Systematic allocation e.g. odd birthday → A even birthday → B invites selection bias. The physician will be able to determine in advance whether a potential patient will receive a treatment or control, and may use this information to exclude them from the study. Example: Historical controls: e.g. current patients get treatment, use pre-trial patients as controls Problems? Don’t know that observed differences are due to treatments. There may be temporal trends, changes in the definition of disease, improved diagnostic pro- cedures, and so on, which argue against this sort of assignment.

c School of Mathematical Sciences, University of Adelaide 7 Biostatistics III

2 The design and analysis of clinical trials

Key references: See ‘Notes for students’ for a number of important textbooks on clinical trials; see especially Pocock’s book, Piantodosi’s book, and Armitage, Berry and Matthews; see also the Encyclopedia of Biostatistics overview article on clinical trials (handout).

2.1 Phases of trials

Within the pharmaceutical industry and more broadly, clinical trials are classified into four types, each of which has well-defined objectives, as follows: Phase I: trials are exploratory and concerned with aspects of clinical pharmacology and toxicity. An objective is to find a suitable dose level that avoids unacceptable adverse side effects. Usually the study size consists of between 20 and 80 healthy volunteers, often pharmaceutical company employees or medical students. Phase II: trials are pilot studies representing the initial clinical investigation. They are of moderate size involving 100 to 300 diseased patients, and are concerned with evaluating efficacy and safety aspects of the drug or treatment. ∗Phase III: trials are the definitive, full-scale evaluation of the new treatment, in which effectiveness if verified, and the presence of any long-term adverse effects monitored. Patients are randomly assigned to the treatment or current standard (or placebo). These trials are typically large, often involve more than 1000 patients, and can last 3 to 5 years or longer, depending on recruitment rates and necessary follow- up time. In this phase, the statistical design and analysis come under most attention and scrutiny. These trials usually represent the final stage of testing which leads to the request to market the drug or treatment. ∗Phase IV: trials refer to further testing and monitoring of experience with the new treatment after it has been accepted and approved for general use. This is some- times referred to as ‘post-marketing surveillance’, and these trials tend to be large, population-based studies. Notes:

• The above categorization is not a strict one; the purpose of a trial can overlap the boundaries of these phases, especially II and III.

• The nomenclature was originally introduced for therapeutic trials, but is now used more widely, especially in disease prevention trials.

• There is an on-going ethical debate about if and when to randomize, and

c School of Mathematical Sciences, University of Adelaide 8 Biostatistics III

for how long. The best basic principle: start randomizing early, although not necessarily at the beginning of drug testing, and continue to random- ize for as long as legitimate uncertainty exists surrounding the safety and efficacy of the therapy, and about the best treatment for the patient.

2.2 Key aspects of trial design

‘Design’ encompasses all the structural aspects of the trial. This is an extremely important aspect of clinical trials - design flaws, such as the trial being too small, or the failure to record a key variable, cannot be corrected at the analysis stage. The main features are:

• The study population; which must be well-defined, and eligibility set out in the study protocol.

• The treatments to be evaluated, especially the choice of control group.

• The sample size, i.e., how many patients? Calculations usually use power arguments, and depends on the type of trial (fixed, sequential). There may be a stopping rule, or some other more flexible design; there may be interim analyses. (N.B.: prior sample size calculations are only a guide to an order of magnitude; there may well be logistical, financial or other constraints (a rare disease for example) which need to be considered.)

• Method of randomization (i.e., method of treatment allocation), includ- ing methods of protecting the randomization code from being broken.

• Procedures for blinding and monitoring compliance.

• Type of trial. A trial can be one or more of the following. The simplest is the two-group

- parallel group trial e.g. A or B - crossover trial (2 treatments, 2 periods)

2 × 2 G1 : A → B G2 : B → A

- factorial trial

c School of Mathematical Sciences, University of Adelaide 9 Biostatistics III

- equivalence trial: these trials are designed to show that a new treatment’s efficacy or safety is ‘the same as’ or ‘at least no worse than’ that of a standard treatment. This is different to the usual superiority trials which seek to find improved or su- perior therapies. - sequential trial, as opposed to a fixed size trial. Sequential or group-sequential trials enable stopping the trial when enough evidence has accrued. - other trial types, e.g., cluster-randomized trials.

• outcome measures, which we use to assess treatment efficacy

- disease incidence - death rate - survival time (‘time to event’ data) - alleviation of symptoms - ...

A trial will usually have

• a detailed study protocol covering all aspects of the study, especially pro- cedures for individual patient eligibility, and

• an operations manual, which specifies how the study is to be conducted.

The Data Safety Monitoring Board (DSMB) plays a key role in monitoring all aspects of the design, conduct, analysis and interpretation of the trial. See “Data monitoring committees in clinical trials” by Ellenberg, Fleming and DeMets, Wiley, 2002.

2.3 Methods of randomization

The aim is to generate a randomization list, then to allocate patients to treatment according to the list. The list can be generated in advance and kept in the trial coordination centre for example, or the allocation can be generated dynamically as the patients enrol. Historically, trialists used tables of random numbers. However, tables can be cumber- some for all but the simplest trials, and if staff or doctors have access to the tables, they may be able to find the sequence in use and predict the next assignment.

c School of Mathematical Sciences, University of Adelaide 10 Biostatistics III

Nowadays, we typically use computer assignment by phone, fax, or encrypted code on-line. Ideally, list should be verifiable. In Australia, the largest and best-known trial centre is the NHMRC Clinical Trials Centre, Sydney University, www.ctc.usyd.edu.au

2.3.1 Simple (or complete) randomization

Illustration: we could assign patients to treatments A and B by tossing a coin Heads → A T ails → B

• this tends to be time consuming, impractical for large trials, and • cannot be checked.

In practice: we use random-number generators, or tables for our purposes of prac- tice and illustration. To perform simple randomization, we apply a suitable rule to the stream of random numbers. Example: Table 5.2 from Pocock (handout)

• randomly choose a starting point in the table, • obtain a ‘stream’ of random numbers by working across rows (or down columns) thereafter, • apply a suitable rule to the stream of numbers  1  For example, for 2 treatments A, B p = for A or B 2 0 − 4 → A 5 − 9 → B

Start at the top left hand corner of Table 5.2: 0 5 2 7 8 4 3 7 4 ... ABABBAABA... Randomization list Keep going to 20 patients, to get: 8 on A 12 on B

c School of Mathematical Sciences, University of Adelaide 11 Biostatistics III

This is unbalanced, but it could be worse. For example, the probability of obtaining 4 on one treatment and 16 on the other, or worse, is ' 0.0118. [See Tutorial 2.] The idea is: that the treatment assignment is unpredictable, and in the long run, the sizes of the groups are roughly comparable (i.e., balanced). Example: assume 3 treatments A, B, C, to be allocated to patients with equal prob- ability. A suitable assignment rule is

1 − 3 → A 4 − 6 → B 7 − 9 → C 0 ignore

Exercise: generate a randomization list for 15 patients.

To summarize: Want: treatment assignment to be unpredictable. Want: overall balance. Simple randomization is OK for > 200 patients and two treatments, in that the chance of severe imbalance is negligible. However, we also want reasonable balance

• at any time

• and within subgroups.

But simple randomization may not be adequate for these requirements. Although we will achieve balance in the long-run with simple randomization, there may be oc- casional long sequences of one treatment (which induces an unwanted homogeneity amongst all patients recruited at that time).

2.3.2 Restricted randomization

Restricted randomization refers to schemes with enhanced balancing properties.

Random permuted blocks: (RPB)

• this is the easiest method

• guarantees equal numbers in each group after every block of r patients

c School of Mathematical Sciences, University of Adelaide 12 Biostatistics III

• it is the most widely used method of restricted randomization.

[The Altman and Gore handout from the BMJ provides excellent background read- ing.] The scheme:

• if there are t treatments, choose k = number of replicates of each treat- ment per block, and take the block size to be r = kt

• for each block of size r = kt, choose a random permutation of treatments in which each treatment is replicated k times

• concatenate the blocks to form the randomization list.

Example: 2 treatments A, B; choose k = 1. In this case, there are two possible blocks: AB BA Using random digits 0 − 9 fromTable 5.2, we can construct the randomization list as follows:

0 − 4 → AB 5 − 9 → BA

Thus, the sequence/stream is 0 5 2 7 8 4 ... AB BA AB BA BA AB . . . Randomization list

That is, after every second patient, there are equal numbers of patients on each treat- ment. This gives tight control over balance, but is predictable.

4 There is less predictability with k = 2; then r = 2 × 2 = 4, and there are = 6 2 possible arrangements of block length four: AABB BBAA ABAB BABA ABBA BAAB

Increasing k will reduce predictability, and reduce balance.

c School of Mathematical Sciences, University of Adelaide 13 Biostatistics III

A good way to reduce predictability further, without compromising balance, is to randomly choose a different k for each block, i.e., randomly vary block length.

Example: use Table 5.4 from Pocock to assign 3 treatments in blocks of 15 patients. Assign digits

1 − 5 → A 6 − 10 → B 11 − 15 → C 0, 16 − 19 ignore

Block 1 11 19 15 5 9 0 6 13 7 2 ... C − CAB − BCBA...

Block 2 14 12 0 1 19 8 7 17 11 ...... CC − A − BB − C... and so on.

2.3.3 Biased coin designs (BCD)

• This is a method of dynamic allocation.

• Like random permuted blocks, the biased coin method controls balance in the randomization list, but is less predictable, and therefore less liable to selection bias.

• Like RPB’s, BCDs avoid severe imbalance in small trials and within sub- groups.

The idea: is to compromise between a perfectly balanced experiment and the ad- vantages of complete (i.e., simple) randomization. The process is as follows: assume we have two treatments

T active treatment C control

Suppose that n patients have currently been allocated to treatment with

c School of Mathematical Sciences, University of Adelaide 14 Biostatistics III

.& Tn on TCn on C

th Let Dn = Tn − Cn, and allocate the (n + 1) patient as follows:

 T with probability p If Dn < 0 → C with probability q = 1 − p where p > 1/2

 T with probability p = 1/2 If Dn = 0 → C with probability 1/2

 T with probability q If Dn > 0 → C with probability p where p + q = 1. The assignment rule balances the number of T ’s and C’s by observing which group has fewer patients so far; that group then has probability greater than 1/2 of being assigned. Clearly, we must have p > 1/2; p is called the bias and we write BCD(p). Typical values for p are: 3 which is adequate for large trials p = 5 (n > 100)

2 p = is useful for small trials 3

3 maintains strict control over p = 4 balance, but is predictable 3 Example: allocate 20 patients to T or C using a BCD and Table 5.2. 5 Let Dn = Tn − Cn

c School of Mathematical Sciences, University of Adelaide 15 Biostatistics III

Scheme:

Dn < 0 0 − 5 → T p = 0.6 6 − 9 → C p = 0.4

Dn = 0 0 − 4 → T p = 0.5 5 − 9 → C p = 0.5

Dn > 0 0 − 5 → C p = 0.6 6 − 9 → T p = 0.4

Using the 15th row: Table 5.2:

0 2 2 7 2 4 6 ... ↑ ↑ ↑ TCTTCCC...

[Finish this as an exercise.]

The balancing properties of BCD:

Let Xn = |Dn|, that is, the absolute difference between the number of T ’s and C’s after n allocations.

Then Xn forms a Markov chain with states {0, 1, 2,... }.

Beginning at X0 = 0 with probability 1, the transition probabilities are

P (Xn+1 = x − 1|Xn = x) = p for x = 1, 2,...

P (Xn+1 = x + 1|Xn = x) = q

P (Xn+1 = 1|Xn = 0) = 1

These probabilities describe the evolution of the chain, which is called a random walk with a reflecting barrier at the origin.

• Based on the stationary distribution, one can show that for p = 3/5, there is a 1/20 chance of the imbalance being ≥ 10.

• For p = 3/4, the corresponding imbalance is 4.

• One can set pre-defined limits for imbalance.

c School of Mathematical Sciences, University of Adelaide 16 Biostatistics III

• An extension: use simple randomization until the limit is exceeded, then introduce a biased coin allocation to correct the imbalance.

We often want to achieve balance within subgroups defined by important factors, such as age, sex, time in remission from leukaemia, ..., and so on. Two strategies for achieving such balance are:

• minimization, and • stratification

2.3.4 Minimization

• Minimization is a dynamic method of restricted randomization, and an an extension of BCDs. • Aims to minimize the imbalance in the numbers of patients allocated to T and C over a factor or factors known to affect prognosis, including centre or hospital in a multi-centre trial. • Minimization does not directly address imbalance within subgroups de- fined by combining several factors simultaneously to form strata: see stratification. Minimization addresses imbalance in a ‘marginal’ way.

It works as follows: For a new patient, identify their levels of several important prognostic factors, for example, sex, age at diagnosis, clinical stage of disease at diagnosis, and so on. Call these categories.

th For the i category, observe that the new patient’s level already has Ti patients on T and Ci patients on C.

Define a discrepancy score, Si, (this is also called a balancing function): Typically, we use one of

S1i = Ci − Ti range

Ci − Ti S2i = Ci + Ti + 1   1 if Ci > Ti S3i = 0 Ci = Ti  −1 Ci < Ti

c School of Mathematical Sciences, University of Adelaide 17 Biostatistics III

The total discrepancy score is then X X S = wiSji or S = Sji j = 1, 2, 3 i i

th where wi is the weight attached to each factor or category; wi is large if the i factor is important, and small otherwise. Once the discrepancy score is determined, we allocate the new patient to T or C, whichever minimizes the overall imbalance according to a biased coin with high bias (p = 3/4). [See handout by Gore on ‘Restricted randomization’ from the BMJ.]

Summary:

• Identify levels of important factors for the new patient, e.g., male, age 75, caucasian, no history of lesions. Denote by i = 1, . . . , n.

• Choose the score: one of S1, S2, or S3. Let’s choose S1.

n X • Calculate S = S1i; then allocate the patient to T or C to reduce the i=1 imbalance.

Example: Simple mastectomy plus radiotherapy (T ) versus radical mastectomy (C). (See handout by Gore.)

3 X S = (Ti − Ci) i=1 = (8 − 7) + (12 − 13) + (13 − 16)

= 1 + (−1) + (−3)

= −3

=⇒ choose T with high probability (p = 3/4), because allocation to T reduces |S|.

c School of Mathematical Sciences, University of Adelaide 18 Biostatistics III

If you allocate the next patient to C, the imbalance gets worse:

S∗ = (8 − 8) + (12 − 14) + (13 − 17)

= −6

On the other hand, if we allocate to T ,

S∗ = (9 − 7) + (13 − 13) + (14 − 16)

= 0.

In this example, balance is characterized by the range of treatment totals, and the next allocation is selected by minimizing the sum of the ranges across the factors/categories.

Minimization can be generalized to > 2 treatments.

Minimization: a more general scheme

Reference: Pocock and Simon, (1975) Biometrics. For r treatments, k = 1, . . . , r f factors, i = 1, . . . , f lf levels in f, j = 1, . . . , lf

th th tijk = number on treatment k in j level of i factor

∗ tijk(k) = number on treatment k (etc.) if k is allocated to next patient

∗ Fij(tijk(k)) is the balancing function, e.g., the range or variance. Then the overall balancing function is a weighted sum of the balancing functions of the individual factors: X X ∗ Bk = wiFij(tijk) i j where again, the weights are assigned if necessary on the basis of the relative im- portance of the prognostic factors. Moreover, the biased coin probabilities are deter- mined by the Bk. For example, rank the r treatment assignments from least imbalance, k = 1, to most

c School of Mathematical Sciences, University of Adelaide 19 Biostatistics III imbalance, k = r. Then take 1 p > 1 r 1 − p p = 1 , k = 2, . . . , r. k r − 1

That is, the degree of randomization is inversely related to p1. Note that the design 1 is fully randomized if p1 = . r Remarks: Note that only the unique levels of each factor for the new patient are affected by the choice of k. We can show that the values of Bk are especially easy to update and compute if the variance is used as the balancing function.

2.3.5 Stratification

• Stratification enables balance within strata defined by simultaneous com- binations of factors. • In essence, we generate a separate randomization list using RPB or BCD for each stratum. • Random permuted blocks within strata is probably the most widely used method of randomization in clinical trials.

It is important to avoid too many strata ⇒ only use important factors. For example, a trial with 5 factors each with 3 levels gives 35 = 243 distinct strata. =⇒ blocking will be rendered ineffective unless the trial is very large. Can use minimization to avoid this problem.

Remarks: In larger trials, we usually ignore blocking and stratification in the pri- mary analysis. The reason is that the extra complexity not worth the (small) gain in expected power. However, this is a topic of debate: stratification makes treatment groups more alike, which in turn implies that the treatment estimate more is precise; but the variance is biased (positively), and this implies that tests are conservative. [See the Encyclopedia of Biostatistics for further discussion.]

2.3.6 Randomization tests

Randomization provides the basis for classes of non-parametric (or distribution free) tests called randomization tests, which are finding increasing popularity in clinical

c School of Mathematical Sciences, University of Adelaide 20 Biostatistics III trials applications. Suppose there are 2m patients to be randomized to two equal-sized groups:

m randomized to A m randomized to B

How many possible permutations are there? (2m)! How many possible distinct permutations? 2m (2m)! = m (m!)2

We use randomization to choose one of the (2m)! possible designs, all equally likely. Consider the null hypothesis of no treatment difference

H0 : µA = µB i.e, the expected response for any individual is the same, whichever treatment they receive.

We view H0 in reference only to the (2m)! patients randomized (this is known as a deterministic hypothesis). It follows that for any of the possible designs, we can find exactly the observations that would have been obtained, simply by permuting the data. Thus for any test statistic, we can find the exact null hypothesis distribution. Here, we take all permutations of the 2m observations and treat them as equally likely. These are known as re-sampling-based methods. Example: The two-sample permutation t-test. Suppose we observe responses

x = (x1, . . . , x2m) Under the null hypothesis and the randomization distribution, all permutations of x are equally likely. The optimal normal theory test-statistic for comparing two groups (assuming con- tinuous responses) is the two-sample t-statistic: X¯ − X¯ T = A. B. r 2 s p m

c School of Mathematical Sciences, University of Adelaide 21 Biostatistics III

where sp is the (sample) pooled standard deviation. Recall: we obtain this form of the t-statistic by assuming equal group variances

2 2 2 σA = σB = σ ¯ ¯ so that the standard error of the target difference XA. − XB. is s  1 1  r 2 σ2 + = σ m m m

2 2 We estimate σ by the pooled sample variance, sp. Now suppose the observed value of T for the data x is

x¯A. − x¯B. tobs = r 2 s p m

Permutation test procedure:

• For each permutation of the data x, calculate the test statistic (here, the two-sample t-statistic).

• Take the first m values in the permutation to be ‘group A0; Take the sec- ond set of m values to be ‘group B’.

• One of the values of t obtained in this way corresponds to the observed data values tobs.

• And we know 1 P (T = t ) = obs (2m)! provided all the values of T are distinct.

For simplicity, we will consider testing the one-sided alternative hypothesis

HA : µA > µB i.e., large values of t are evidence against H0.

We therefore calculate the permutation P -value corresponding to tobs:

k(x) P (T ≥ t ) = obs (2m)!

c School of Mathematical Sciences, University of Adelaide 22 Biostatistics III where k(x) is the number of permutations of {1,..., 2m} giving values of the test statistic ≥ the observed value tobs.

For a level α test, reject H0 if tobs is among the largest positive 100α% permutation values.

  Recall: α = P (reject H0|H0 true)   = Type I error probability

Notes:

2m (1) Need only consider all possible distinct permutations of the data m (i.e., distinct permutations are equivalence classes with the same value of the test statistic; k(x) is modified accordingly).

(2) If m is large, the permutation procedure will involve substantial com- putation, and we can then take a random sample of permutations. This leads to an approximate permutation distribution.

Remark: Biased coin designs affect the validity of standard permutation tests, be- cause the allocation of sequences is not equiprobable. One can simulate the correct null hypothesis distribution though.

2.3.7 Randomized consent designs

• See the Altman and Gore Handout from the BMJ.

• Due to Zelen (1979).

• Different rationale to other trial designs.

• The results are analysed by ‘intention to treat’.

→ compare original randomized groups .& group 1 group 2 ∆ means: µ µ + 2 Comparison weaker than µ compared with µ + ∆.

c School of Mathematical Sciences, University of Adelaide 23 Biostatistics III

Figure 1: Standard design and Randomized consent design

• Design only works if a high proportion of Group 2 consent to the new treatment → otherwise we obtain an inefficient estimator of ∆.

• Used in criminology.

• Randomized consent designs have been used in medical trials, but heav- ily criticised.

• Unethical?

2.4 Trial size

Key references: Armitage, Berry, Matthews H 6.6;

2.4.1 Introduction

A key issue in the design of any experiment is the number of observations to be made.

c School of Mathematical Sciences, University of Adelaide 24 Biostatistics III

The two extremes of design .& fixed trial size fully sequential trial (adaptive design)

number of patients enrollment and observation specified at outset continue until e.g. AZT trial a stopping boundary is crossed – most common

– not widely used

– often impractical

– stopping rule can be complex

– trial continues if no difference (known as futility trials)

In between these two extremes are group sequential methods, for which

• interim analyses are performed on the accumulating data, with accurate control of Type I error; • these designs are popular, flexible, and relatively easy to conduct; • and have been developed in recognition of the fact that large multi-centre trials are usually subject to regular analyses by the Data Safety Monitor- ing Board.

In determining the appropriate trial type and size, we usually have to strike a bal- ance between

• the cost per patient, and • the increase in precision for each additional patient.

Ultimately, this usually comes down to a matter of judgement. Nevertheless, it is highly desirable to make an advance, approximate calculation of the likely precision to be achieved in the trial. This protects against wasting re- sources on achieving unnecessary precision, or more commonly, against undertak- ing a trial of low power/precision, from which useful conclusions are unlikely.

c School of Mathematical Sciences, University of Adelaide 25 Biostatistics III

2.4.2 Fixed trial size (non-sequential analysis)

The basic principle is to decide on the precision to be aimed at in detecting a treat- ment difference (or contrast), then tos relate the sample size to this. Consider quantitative (continuous) outcomes, for two independent groups. Let the responses on A and B be

2 group AXA1,XA2,...,XAn i.i.d. N(µA, σA)

2 group BXB1,XB2,...,XBn i.i.d. N(µB, σB) and all observations independent. Consider problem of testing the null hypothesis of no treatment difference:

H0 : µA = µB against the two-sided alternative

HA : µA 6= µB, or, the problem of estimating the true treatment difference

δ = µA − µB

The treatment difference δ = µA − µB is called the target quantity. ¯ ¯ The natural estimator for δ is XA. − XB., which has (true) standard error r σ2 σ2 A + B n n

2 2 2 When σA = σB = σ , the standard error is

r 2 σ n

We consider two approaches to determining n, the number of patients required in each group:

c School of Mathematical Sciences, University of Adelaide 26 Biostatistics III

(i) Fix the standard error, and solve for n. We specify in advance that the standard error of the difference must not exceed , i.e., s.e. ≤ . Then r 2 σ ≤  n 2σ2 =⇒ n ≥ 2

Clearly as n increases, the s.e. decreases, and the precision increases. Note that specifying  determines the shape of the distribution of the ¯ ¯ observed difference, D = XA. − XB.. See Figure 2.

Figure 2: Different shapes for the distribution of D

(ii) Power calculations: these involve finding n for given power against a specified alternative hypothesis. This is the most common approach to estimating trial size.

We specify δ = δ1, which is the smallest difference of clinical importance that we would not want to overlook. Consider now the problem of testing the null hypothesis of no treatment difference:

H0 : δ = 0 v.s HA : δ = δ1

c School of Mathematical Sciences, University of Adelaide 27 Biostatistics III

where δ = µA − µB. If σ2 is known, use the z-test: (X¯ − X¯ ) − 0 Z = A. B. r 2 σ n

Under H0, Z ∼N(0, 1).

To test H0 at the 2α level of significance, we use the rule reject H0 if |Z| ≥ zα (see Figure 3).

Figure 3: Standard normal distribution

Recall,

2α = Type I error probability

= PH0 (reject H0|H0 true).

Consider the alternative hypothesis HA : δ = δ1, then

β = Type II error probability

= PHA (retain H0|HA true).

The power of a test against a specified alternative hypothesis is

1 − β = P (reject H0|HA true).

We want procedures to find n to achieve specified Type I and Type II errors.

c School of Mathematical Sciences, University of Adelaide 28 Biostatistics III

Figure 4: Sample size power calculation

Now for σ2, α, δ, given, then β is a function of n. Thus for trial sample size calculations we specify β, then solve for n; Figure 4 gives an outline.

C1 is what the critical value must be to achieve the desired significance level, 2α; C2 is the critical value needed to achieve the desired power, 1−β. We cannot both reject and retain H0 at the same time, so the sample size must be such that C1 ≤ C2. ¯ ¯ Consider the difference D = XA. − XB.. 2 Under H0, D ∼ N(0, 2σ /n), and

r 2 C = z σ 1 α n

We ignore the probability of Z < −zα as being very small when µA−µB = δ1. 2 Now, under HA : D ∼ N(δ1, 2σ /n), so

r 2 C = δ − z σ . 2 1 β n

Note that the power is one-sided: in practice, it is not desirable to reject H0 in favour of δ1 < 0 when actually δ1 > 0, as this would imply a recommendation of the inferior of the two treatments.

c School of Mathematical Sciences, University of Adelaide 29 Biostatistics III

Thus for C1 ≤ C2, we must have

r 2 r 2 z σ ≤ δ − z σ α n 1 β n r 2 ⇒ δ ≥ σ (z + z ) 1 n α β 2 ⇒ δ2 ≥ σ2 (z + z )2 1 n α β 2 σ 2 ⇒ n ≥ 2 2 (zα + zβ) , δ1 where n is the minimum number of patients required in each group. In- creasing n will better separate the two distributions, or equivalently, in- crease the power. What considerations will push the sample size up?

A more formal argument: Consider the standardized test statistic

X¯ − X¯ Z = A. B. r 2 σ n

We can write this as

n n X X   XAi − XBi i=1 i=1 (µA − µB)  = √ ∼ N  r , 1 2nσ2  2σ2  n

Under H0 : Z ∼ N(0, 1) (µA = µB)

 r n  Under HA : Z ∼ N ±δ1 , 1 2σ2

Consider the positive alternative:

µA − µB = δ1

and again ignore the very small probability that Z < −zα.

c School of Mathematical Sciences, University of Adelaide 30 Biostatistics III

Thus

β = PHA (Z ≤ zα)

 r n r n  = P Z − δ ≤ z − δ HA 1 2σ2 α 1 2σ2 r n δ1 ∼ N(0, 1) 2σ2 r n ⇒ z − δ = −z . α 1 2σ2 β Rearranging, gives 2 2 2σ n = (zα + zβ) 2 . δ1 To achieve power ≥ 1 − β (i.e. better separation) we must have r n z − δ ≤ −z α 1 2σ2 β

2 2 2σ =⇒ n ≥ (zα + zβ) 2 δ1 where n is the minimum number of patients required in each group. • We want a high value of power, which can only be controlled at the design stage. (Note that we can control α during analysis by choice of significance level.) • If σ is estimated, we can use the t-distribution.

Example: We propose to study lung function in two groups of men. The response is FEV, forced expiratory volume (in ml). From previous studies, we know that

σ = 0.5 (ml)

The minimum ‘clinically significant’ difference is δ1 = 0.25. Use a two-sided test, 2α = 0.05, and power 80% (this is a typical power assumption). The question: how many men are required in each group?

α = 0.025, zα = 1.96

β = 0.2, zβ = 0.842

c School of Mathematical Sciences, University of Adelaide 31 Biostatistics III

So  0.5 2 n ≥ 2 (1.96 + 0.842) × 0.25

= 62.3 i.e., the minimum trial size that achieves 80% power is 63 men in each group. What if we want 95% power? Check as exercise: would need at least 104 men in each group.

Note that C1 is what the critical value must be to achieve the desired significance level 2α. C2 is the critical value needed to achieve the desired power, (1 − β). The sample size must be such that C1 ≤ C2. The above arguments generalize to other types of outcomes, and the principles are quite general. In particular, the normal approximation works well for a wide variety of situations: Consider ¯ ¯ C1 = zα s.e.H0 (XA. − XB.),

¯ ¯ C2 = δ − zβ s.e.HA (XA. − XB.)

Note that the standard errors under the null and alternative hypotheses can be dif- ferent, e.g., if comparing two proportions. Further remarks on comparing 2 means:

2 2 2 (1) If σA 6= σB, we can approximate σ by their average.

2 2 2 2 (2) If we know σA, σB (σB > σA), we can let nA = n, nB = kn, for some k. The standard error is then r σ2 σ2 A + B n kn

and we can use this to determine the sample size [via C1 ≤ C2] with possibly unequal group sizes.

Trial size for a binary response:

c School of Mathematical Sciences, University of Adelaide 32 Biostatistics III

We now consider qualitative outcomes, and in particular, a binary response Y , e.g., yes/no survived/not survived.

Assume we have two groups A, B, and assume there may be nA subjects on A, and nB subjects on B (in the first instance, we are allowing the group sizes to differ).

Suppose we observe Xj ‘successes’ in group j, with j = A, B. Under the usual independence assumptions,

Xj ∼ B(nj, πj) independently, i.e., XA,XB are independent binomial random variables. We want to compare two binomial proportions, i.e., to make inference on

πA − πB (the target quantity).

Under the above assumptions, Xj Pj = nj has πj(1 − πj) E(Pj) = πj, Var(Pj) = . nj

Hence, we use PA − PB, the observed difference in proportions, to estimate πA − πB, where s πA(1 − πA) πB(1 − πB) s.e.(PA − PB) = + . nA nB

The hypotheses are:

H0 : πA = πB v.s. HA : πA 6= πB

We use the usual approximate normal-theory test based on

PA − PB Z = (Wald Test) rP (1 − P ) P (1 − P ) A A + B B nA nB which is approximately N(0, 1) under H0.

Sample size calculation:

Assume nA = nB = n (i.e., equal group sizes), then solve for n in C1 ≤ C2:

zα s.e.H0 (PA − PB) ≤ δ1 − zβ s.e.HA (PA − PB).

c School of Mathematical Sciences, University of Adelaide 33 Biostatistics III

However, it is typically not the case that either standard error can be estimated in advance in a simple manner. We will look at 4 approaches:

(1) We know that if 0 < πi < 1, then 1 0 < π (1 − π ) ≤ . i i 4 So as a conservative approach, appropriate when there is little or no in- formation available, use r 1 s.e.(PA − PB) ≤ 2n

(2) Sometimes we have available a ‘prior’ estimate for the probability of suc- ∗ cess in the control group, πB. Call this πB.

Then under H0 : πA = πB, the s.e.(PA − PB) is estimated by r 2π∗ (1 − π∗ ) s.e.∗ = B B H0 n

∗ 1 Under HA, we still estimate πB by π , and use πA(1 − πA) ≤ . Then B 4 v u 1 uπ∗ (1 − π∗ ) + t B B s.e.∗ = 4 HA n

(3) A further improvement on (2): observe that under

HA : πA − πB = δ1, we have πA = δ1 + πB. ∗ So if πB is available, let ∗ ∗ πA = δ1 + πB and use r 2π∗ (1 − π∗ ) s.e.∗ = B B H0 n as above, and

r ∗ ∗ ∗ ∗ (π + δ1)(1 − π − δ1) + π (1 − π ) s.e.∗ = B B B B HA n r π∗ (1 − π∗ ) + π∗ (1 − π∗ ) = A A B B . n

c School of Mathematical Sciences, University of Adelaide 34 Biostatistics III

(4) If prior estimates are available under H0 : πA = πB = π, say, then use π = the pooled, or average proportion.

Continuity correction: For small samples, a continuity correction is used to obtain a more accurate assess- ment of significance from the asymptotic normal distribution for Z. Fleiss recom- mends adding 2 2 = |πA − πB| |δ1| to the sample size to allow for this. [See Fleiss˙ (1980), or the 2nd edition of his book.]

Some general remarks:

(1) Sample size calculations are possible for more complex outcomes

e.g., survival time, correlations.

These often require specialized software.

(2) Sample size calculations are only a guide to the order of magnitude re- quired;

- they use ‘inputs’ that are subject to unquantifiable errors; - one can also make adjustments for dropouts, noncompliance, etc.

(3) As a general rule, try to obtain as many observations as possible to esti- mate the treatment difference!

2.4.3 Sequential Trials

Key references: Chapter 18 Armitage et al. (2002); Jennison and Turnbull (2000); Whitehead (2002)]. Sequential trials have a long history in medical trials, and in industrial processes. Fully sequential trials which require that the data be analysed after each outcome has been observed, are not always practical in the medical context (need a quick response, and storing and updating the information can become a major task).

c School of Mathematical Sciences, University of Adelaide 35 Biostatistics III

Consider a simple case: where we want to achieve a specified precision in estimating µ.

Suppose we observe a random sample of size n from a distribution (µ, σ2).

The estimated mean is x¯n, and the estimated standard deviation is sn.

Therefore, the estimated s.e. of x¯n is sn s.e.(¯xn) = √ . n This will tend to decrease as n increases. Suppose we require that s.e.(¯xn) < .

Then a stopping rule will be: sn continue sampling until √ first falls below . n This is an example of an adaptive design, where the stopping rule is based on the outcomes as they emerge. (One can show that if the sampling is repeated many times, the usual confidence intervals are approximately valid.)

Group sequential methods: Pocock’s Test Key references: Jennison and Turnbull (2000); Chapter 18, Armitage, Berry & Matthews (2002). The whole approach is based on the idea of a ‘repeated significance test’ (RST). The situation very similar to that of adjusting for multiple comparisons: the aim is to control the Type I error, here termed the ‘overall significance level’ (viz: 2α = 0.05). Clearly, to control the overall significance level at a low value, a much higher signif- icance level (i.e., a lower probability) is required at each stage. Pocock’s test is the simplest group sequential plan. It uses a constant nominal significance level to analyse data a small number of times over the course of the study, i.e., 2α0 is smaller than 2α).

Pocock’s Test: an example of a ‘RST’ plan:

As pointed out above, the idea is to control the Type I error at a constant level. So, what value of 2α0 should be chosen? The answer depends on K, the number of times we analyse and test the data. Patient entry is divided into K groups, each with m patients on each treatment. The data are then analysed after each group of 2m responses.

c School of Mathematical Sciences, University of Adelaide 36 Biostatistics III

Assume treatment allocation within each group is random, and assume two treat- ments A, B, with responses:

2 XAi ∼ N(µA, σ )

2 XBi ∼ N(µB, σ ) i = 1, 2,...

Use the standardized statistic after each group of observations: so at the kth test, define: ¯ ¯ XA. − XB. Zk = r k = 1,...,K 2σ2 mk mk mk ! 1 X X = √ XAi − XBi 2 2mkσ i=1 i=1

Formally, the stopping rule is as follows: After group k = 1,...,K − 1

if |Zk| ≥ Cp stop, reject H0

otherwise continue to group k + 1, after group K

if |Zk| ≥ Cp stop, reject H0

otherwise stop, retain H0. where Cp = Cp(K, α) is the critical value, and a constant; it is calculated to give an overall Type I error of 2α. That is,

PH0 (Reject H0 at analysis k = 1, or k = 2, or, ..., or k = K) = 2α

(e.g., 2α = 0.05).

For this probability, we use the joint distribution of the sequence Z1,...,ZK . [See Ch. 19, Jennison & Turnbull]. These are not independent, in that the distri- bution of each Zk depends on Zk−1; we need to use numerical integration (usually Gaussian quadrature) to evaluate the distributions.

c School of Mathematical Sciences, University of Adelaide 37 Biostatistics III

Note that K = 1 is the non-sequential case. | {z } fixed sample For Pocock’s test, we usually take K ≤ 5.

For example, if K = 5, 2α = 0.05, then we can show Cp = 2.413, so that the nominal significance level applied at each analysis is (See Figure 5)

2α0 = 2{1 − Φ(2.413)}

= 0.0158

If K = 1, 2α = 0.05, Cp =?

Figure 5: Pocock’s Test

The power requirement,

PHA (reject H0) = 1 − β determines the group size.

The general scenario:

Suppose we are interested in the parameter θ (e.g., µA −µB), and the null hypothesis H0 : θ = 0, and want high power against the alternatives

HA : θ = θ1 or θ = −θ1

Assume the Type I and II errors are as before.

  δ = µA − µB ˆ ¯ ¯ δ = D = XA. − XB.

Suppose that:

c School of Mathematical Sciences, University of Adelaide 38 Biostatistics III

(1) At any stage, we can estimate θ by θˆ(the maximum likelihood estimator), and its variance by Var(θˆ).

(2) We inspect the accumulating data at intervals (up to K times) such that at the kth test, 1 ∝ k Var(θˆ) | {z } I = kI

where I is called the ‘Fisher information’, and increases by I between successive tests.

Recall, when comparing two normal means, that

2σ2 Var(θˆ) = Var(δˆ ) = , 1 km and the mean under HA is r mk √ δ = δ I 1 2σ2 1 at the kth test.

Example: A trial comparing the effects of Vitamin D supplements and a control treatment in pregnant women. (From Armitage and Berry, third edition; using Table p 15.6; in what follows, we use δ1 for µ1 in A&B, and δ1 = µ1 m/σ.) The response is the infant’s calcium concentration (in mg/100ml), 6 days after birth. The standard deviation (σ) of the response was 1.2; and the investigators required to detect a change in [Ca]=0.3 (= δ1). It was assumed that 2α = 0.05, 1−β = 0.95, and the investigators intended to inspect the data K = 3 times. The question: how many women should be included in each group? Let there be m women on each of the two treatments A, B in each group.

c School of Mathematical Sciences, University of Adelaide 39 Biostatistics III

After the kth stage (k = 1, 2, 3) 2σ2 Var(δˆ ) = Var(D) = , (D = X¯ − X¯ ) 1 mk A. B. r p m so µ1 m/σ = δ1 2σ2 r m i.e., 2.22 = 0.3 2(1.44) 2.222 m = 2(1.44) 0.32

= 157.7 → 158

158 × 2 × 3 = 316 × 3

= 948

Thus, the total number of patients would be 948 unless the trial stopped after one of the two interim analyses. Each interim test would be conducted at the 2.2% significance level. [See Table 2.1 from Jennison & Turnbull.]

Remarks:

(1) The maximum total number of subjects that may be required is greater than that for a fixed sample test. But there is benefit in the group sequen- tial plan if δ1 is large. (2) There are many forms of stopping rule. The most popular is the O’Brien- Fleming scheme, in which it is more difficult to reject H0 early on. At the end of the trial, CB(K, α) is closer to the non-sequential value (see handout). (3) There is a need to be flexible, and the group sequential schemes provide this. (4) There are other methods for ‘spending’ Type I error [see Whitehead (2002) and EoB (1998 or 2005)].

We can also reduce the required sample size using alternative designs, and we con- sider some important trial types in the following sections.

c School of Mathematical Sciences, University of Adelaide 40 Biostatistics III

2.5 Crossover trials

2.5.1 Introduction

Motivation: can produce the same standard error for the target quantity of interest with fewer patients. But this is at the price of requiring additional assumptions. Scenario: each patient receives each treatment in sequence; and the order in which the treatments are given is randomized. For example: two treatments A, B: Group 1: A then B Group 2: B then A Patients are randomized to Group 1 or to Group 2. This is known as the 2 × 2 crossover trial (2 treatments, and 2 periods). We can see that this design leads to ‘within-patient’ comparisons as well as to ‘between- patient’ comparisons. The repeated measures feature of the design means that we need fewer patients. In the standard parallel-group design, treatment comparisons are based on ‘between- patient’ information. In a cross-over design, important differences (i.e., between treatments) are made on a ‘within-patient’ basis; the main aim of the cross-over trial is therefore to remove from the treatment comparisons any component that is re- lated to differences between the individuals. Key references for crossover trials:

• Armitage et al (2002)

• Altman (1991)

• *Jones & Kenwood (2003)

• *Senn (2002)

EoB article [see Course Information handout for details].

Illustrative example : A heroin trial as a 4×4 crossover design.

c School of Mathematical Sciences, University of Adelaide 41 Biostatistics III

months 0 3 6 9 12 Group 1 H H+M M choice Group 2 M H choice H+M ......

The aim of this trial design was to improve individual injecting drug-user compli- ance (everybody in the trial would get some heroin, and no placebo). But there are many other problems in considering such a design. For instance, changing treat- ment regimes for drug users every three-month would be disruptive, and it can be argued that it would be unethical to disturb the treatment of drug users who are sta- bilised on treatment, etc. Moreover, a high number of drop-outs are likely in such a trial, and this would lead to selection bias.

Crossover designs are especially useful for evaluating treatments for stable chronic conditions where the short-term effects of the treatments are of interest; e.g., trials of anti-hypertensive drugs, e.g.; treatments for asthma in chronic sufferers. Cross-over designs are widely used in early phase (I, II) drug trials, and in bioequiv- alence trials. But they are not appropriate for assessment of long-term conditions, such as the long-term treatment of HIV/AIDS patients, or for assessing survival following a diagnosis of cancer.

Figure 6: Scheme for 2 × 2 crossover trial

Layout for 2 × 2 crossover design: See Figure 6 for the scheme.

Notation:

c School of Mathematical Sciences, University of Adelaide 42 Biostatistics III

Observation yijk

G1 AB ith group jth patient kth period G2 BA i = 1, 2 within gp i k = 1, 2 j = 1, . . . , ni n1 = number patients in Group 1 n2 = number patients in Group 2 n1 not necessarily equal to n2. Induction stage:

• acceptance of eligible patients into study

• random allocation to Group 1 or Group 2.

Run-in period: desirable, but not always possible

• idea is to allow effects of previous medication to dissipate, and patient’s disease state stabilizes

• baseline measurement zij1 (pretest measurement) taken here.

Period 1 (P1):

• patients receive treatment and response yij1 observed

• often yij1 is the average, or maximum, response taken near end of period.

Washout Period:

• again desirable, but not always feasible

• idea: come off P1 treatment and patient’s condition returns to pre-treatment level

• zij2 observed, useful to compare with baseline zij1

• aim is to avoid carry-over effect of P1 treatment into P2.

Period 2 (P2):

• patients receive other treatment, and response yij2 observed

• same comments as for P1.

c School of Mathematical Sciences, University of Adelaide 43 Biostatistics III

There are major problems/effects which are important and affect the model formu- lation and interpretation. These are:

(1) Period effect: this is a systematic effect going from time period 1 to 2; e.g., the patient’s condition may deteriorate over time. (2) Carry-over effects: especially of drug in P1 carried over to P2. May also be psychological, or other effects. (3) (Direct) Treatment × period interaction: e.g., responses on A differ in P1, P2, but responses on B do not (can be due to carryover); e.g., a period effect may be explained biologically by a common carry- over effect. In fact, we cannot distinguish between differential carry-over and other types of interaction in the 2 × 2 design. (4) Withdrawals/dropouts: can affect balance and results, especially in higher- order designs → can lead to selection bias.

We incorporate (1)-(3) formally into the model.

2.5.2 Model formulation

We assume

Yijk = ηijk + Sij + eijk |{z} | {z } fixed random where ηijk systematic component (see below)

2 Sij subject effect i.i.d. N(0, σS) independent < 2 eijk measurement error i.i.d. N(0, σe )

To model ηijk: Grand mean µ Treatment effect τ:

tr.A : 0 B − A = τ tr.B : τ

c School of Mathematical Sciences, University of Adelaide 44 Biostatistics III

Period effect π:

P1 : 0 P2 : π

Treatment × period interaction, γ, implied by above:

B in P2 : τ × π = γ rest : 0

(Differential) Carry-over effect ρ:

B then A : 0 A then B : ρ

We thus obtain an over-parameterized model for ηijk. Expected values:

G1,P1 η1j1 = µ + 0 + 0 + 0

G1,P2 η1j2 = µ + τ + π + (γ + ρ)

G2,P1 η2j1 = µ + τ + 0 + 0

G2,P2 η2j2 = µ + 0 + π + 0

Observe:

• only get carry-over in P2

• γ, ρ appear once each, and together i.e., they are said to be intrinsically aliased.

So, for example, the linear model for the response from the jth patient in Group 1 (AB), Period 1, is: y1j1 = µ + S1j + e1j1

c School of Mathematical Sciences, University of Adelaide 45 Biostatistics III

For the jth patient in Group 2 (BA), Period 2, the model is

y2j2 = µ + π + S2j + e2j2

Note that in each case, 2 2 Var(Yijk) = σS + σe 2 2 where σS and σe are called components of variance. But we also have a covariance term (since responses within a patient are correlated):

2 Cov(Yij1,Yij2) = σS; hence 2 σS ρs = 2 2 σS + σe which is the correlation between responses within a patient. This model/design is an example of a split-plot design, in which the patients are the main plots, and the ‘time points’ where repeated observations are taken in P1, P2 are the sub-plots. Here, the main plots comprise a large component of the error.

Sample size calculations: Let n be the number of patients on each treatment in the usual trial design. Let N be the total number of patients in a 2 × 2 crossover trial. For the same power, we can show that

N = n(1 − ρS).

Clearly, when the within-patient correlation is high (i.e., ρS is large), the advantage of the cross-over design is greatest.

What if ρS = 0?

2.5.3 Analysis

The analysis of the 2 × 2 design is based on the sums (i.e., totals) and the differences of the observations for a subject in P1 and P2. For continuous y: we use t-tests or Mann-Whitney tests. For categorical y: we use χ2 tests. The randomization validates the comparisons between and within groups.

c School of Mathematical Sciences, University of Adelaide 46 Biostatistics III

We will perform the analysis using t-tests and confidence intervals. We work with

Dij = Yij2 − Yij1 (P2-P1)

Tij = Yij1 + Yij2 (P1+P2) for i = 1, 2. Differences: (within patients) We find for

Group 1 (AB) D11,D12,...,D1n1

2 i.i.d. ∼ N(τ + π + γ + ρ, 2σe ). To get the variances, observe that

Dij = Yij2 − Yij1 (P2-P1)

= (ηij2 + Sij + eij2) − (ηij1 + Sij + eij1)

= (ηij2 − ηij1) + (eij2 − eij1).

So,

Var(Dij) = Var(eij2) + Var(eij1)

2 = 2σe .

Similarly,

Group 2 (BA) D21,D22,...,D2n2

2 i.i.d. ∼ N(π − τ, 2σe ) Observe that the two groups are independent.

Totals: (for the analysis between patients)

Group 1 (AB) T11,T12,...,T1n1

2 2 i.i.d. ∼ N(2µ + τ + π + γ + ρ, 2σe + 4σS)

c School of Mathematical Sciences, University of Adelaide 47 Biostatistics III

Again to get the variances, observe that

Tij = Yij1 + Yij2 (P1+P2)

= (ηij1 + ηij2) + eij1 + eij2 + 2Sij. (constant)

So,

2 2 Var(Tij) = 2σe + 4σS.

Similarly,

Group 2 (BA) T21,T22,...,T2n2

2 2 i.i.d. ∼ N(2µ + τ + π, 2σe + 4σS)

Consider the totals: ¯ Group 1: T11,T12,...,T1n1 T1.

¯ Group 2: T21,T22,...,T2n2 T2. Thus, subtracting the totals (Group 1 - Group 2) gives (γ + ρ). ¯ ¯ This suggests basing a test for no interaction/carry-over on T1. − T2., i.e.,

H0 :(γ + ρ) = 0

The t-statistic is: ¯ ¯ T1. − T2. t = on n1 + n2 − 2 d.f., r 1 1 sp + n1 n2 where sp is the pooled estimate of the standard deviation, with

( n n ) 1 X1 X2 s2 = (T − T¯ )2 + (T − T¯ )2 . p n + n − 2 1j 1. 2j 2. 1 2 j=1 j=1

This estimates 2 2 2σe + 4σS = Var(Tij).

If we retain the hypothesis of no interaction/equal carry over, we can then use the differences to test for treatment and period effects.

c School of Mathematical Sciences, University of Adelaide 48 Biostatistics III

(1) For the period effect, apply the two-sample t-test to

D11,D12,...,D1n1

and

−D21, −D22,..., −D2n2 because, if there is no period effect, i.e., π = 0 and γ + ρ = 0, then ¯ ¯ E(D1.) = −E(D2.)

The null hypothesis is H0 : 2π = 0, assuming (γ + ρ = 0).

Thus ¯ ¯ D1. + D2. t = on n1 + n2 − 2 d.f., r 1 1 sD + n1 n2 gives a test of H0 : 2π = 0; 2 sD is the pooled within-group estimate of Var(Dij)

( n n ) 1 X1 X2 = (D − D¯ )2 + (D − D¯ )2 . n + n − 2 1j 1. 2j 2. 1 2 j=1 j=1

(2) For the treatment effect: use the two-sample t-test ¯ ¯ D1. − D2. t = on n1 + n2 − 2 d.f. r 1 1 sD + n1 n2

If (γ + ρ = 0) this gives a test of H0 : 2τ = 0, because ¯ ¯ ¯ ¯ E(D1. − D2.) = E(D1.) − E(D2.)

= (π + τ + γ + ρ) − (π − τ)

= 2τ if γ + ρ = 0.

A sequential approach to the analysis is thus::

c School of Mathematical Sciences, University of Adelaide 49 Biostatistics III

Step 1: Test for interaction/differential carry-over (H0).

Step 2a: Insufficient evidence to reject H0 → test for period effect → test for treatment effect.

Step 2b: If we reject H0, the interpretation is more difficult; - usually need external information on possible causes: (1) True carry-over effects of treatment (different) (2) Psychological carry-over effects (3) True interaction of period × treatment (4) The two groups differ significantly - can check from additional information, such as age, sex.

If a true direct-treatment-by-period interaction exists, we can:

√ • transform the data, e.g., log y, y

• or we may decide to discard the P2 data, and analyse the P1 data only.

Why? Because the period effects and treatment effects are marginal to the interac- tion. Is such an approach justified? Yes, because of the randomization. But it is considerably less powerful, because

• the original sample size calculation was based on the cross-over design, and

2 • σS now enters the treatment comparison.

There is a problem with the sequential approach, in that the t-test for no interaction lacks power too. This is because it is based on the Tij with

2 2 Var(Tij) = 2σe + 4σS

=⇒ t-test is conservative.

Using the baseline measurements Zij1 can help. For example, for each subject, cal- culate 0 Tij = Tij − 2Zij1 then test ¯0 ¯0 T1. − T2.

c School of Mathematical Sciences, University of Adelaide 50 Biostatistics III

Similarly for the P1 data only procedure, calculate:

0 Yij1 = Yij1 − Zij1 ¯ 0 ¯ 0 and compare Yij1 with Yij2. Useful plots: (using terminology as in Kenwood and Jones) Do the plots before the analysis.

(1) Subject-profiles plot: This is the simplest plot. For each group, plot the change in the subject’s responses from P1 to P2;

i.e., yij1 v.s. P1

yij2 v.s. P2

and join the responses by a line. See Figure 7.

Figure 7: Subject-profile plot for one subject.

For example, Group 1 (AB); see Figure 8. The idea is to compare treatments within groups. Look for: trends, outliers, or anything unusual. (2) Groups × periods plot: Plot the four group × period means, and join the treatments by lines. See Figure 9.

c School of Mathematical Sciences, University of Adelaide 51 Biostatistics III

Figure 8: Subject-profiles plot for Group 1.

Notation:

Group 1 y¯1.1 y¯1.2 (AB) (1A) (1B)

Group 2 y¯2.1 y¯2.2 (BA) (2B) (2A)

P1 P2

Examples. In all these examples (Figures 9 to 13), there is a period effect, where the means in P2 are higher. Figure 10:

• Parallel lines indicate that the treatment difference is the same in both periods (i.e., no treatment × period interaction). • Period effect? Yes. • See also plot (a) on handout. • t-test for interaction is a test of parallelism.

Figure 11:

• Carry-over effect of B, which pushes up A in P2(?)

c School of Mathematical Sciences, University of Adelaide 52 Biostatistics III

Figure 9: Groups × Period Plot

• Or there is a true interaction where the treatment difference is larger if the response is high.

Figure 12:

• Suggests A has a large effect carried over to P2. • Or the response may have reached some natural limit =⇒ difference smaller in P2.

Figure 13:

• Indicates treatment × period interaction, as the treatment order is reversed. • Could be a large carry-over effect of A, • or, the result of a true interaction effect.

(3.) Subject differences v.s. totals plot:

• The most helpful plot. • Gives an overall view of the data and effects. • Can see the variation both within and between groups.

c School of Mathematical Sciences, University of Adelaide 53 Biostatistics III

Figure 10: Example 1 Figure 11: Example 2

Figure 12: Example 3 Figure 13: Example 4

c School of Mathematical Sciences, University of Adelaide 54 Biostatistics III

For each subject, plot dij v.s. tij Use a different plotting symbol for each group, and draw the convex hull for each group. Horizontal separation of the groups suggests interaction/carryover; see Figure 14. Vertical separation of the groups suggests a treatment difference; see Fig- ure 15.

Confidence intervals: an estimate of the treatment difference (i.e., effect size) τ is D¯ − D¯ 1. 2. =τ ˆ 2 with standard error 1 ¯ ¯ s.e.(D1. − D2.); 2 t has (n1 + n2 − 2) d.f.. Thus , a (1 − 2α) confidence interval is 1 ¯ ¯ τˆ ± tn +n −2(α) s.e.(D1. − D2.). 1 2 2

Example from Altman: the Nicardipine trial Using our formulation, a 95% confidence interval for τ is (−12.8, −0.16).

Note that the null hypothesis of no interaction is retained, but an estimated 95% C.I. for the interaction term is (−16.5, 9). So, although the interaction is not significant, it is not negligible on the scale of the treatment effect standard error.

Concluding remarks:

• The 2 × 2 crossover design is uncontroversial if differential carry-over, interaction and group effects can be assumed negligible. • Some of the problems are removed by using higher-order designs (which we analyse using linear models). • One can incorporate baseline information and covariates.

c School of Mathematical Sciences, University of Adelaide 55 Biostatistics III

Figure 14: Subject Differences v.s. Totals Plot (horizontal separation)

Figure 15: Subject Differences v.s. Totals Plot (vertical separation)

c School of Mathematical Sciences, University of Adelaide 56 Biostatistics III

2.6 Equivalence trials

Clinical trials are often designed to evaluate whether an experimental therapy or drug (E) is sufficiently similar to an accepted or standard therapy or drug (S) to jus- tify its use. The experimental treatment is expected to be equal, or at least different to within acceptable limits, in effect, but not superior to the standard. Such a study is often called an equivalence trial. The benefits of the experimental treatment may include fewer side effects, convenience of use, or lower cost. Let the parameter δ denote the difference in outcome measures between two treat- ments, for example, δ = µS − µE, or δ = πS − πE. Here, positive values of δ indicate superiority of the standard treatment and δ takes value 0 when the treatments are equally effective. If we wish to demonstrate that the effects of the two treatments do not differ much in either direction, we want to establish that δ lies between two tolerance limits:

δL < δ < δU for which we will ‘declare equivalence’. In equivalence testing, this is the alternative hypothesis, HA. The corresponding null hypothesis is two-sided because it includes values on both sides of the alternative values:

H0 : δ ≤ δL or δ ≥ δU

In an equivalence trial, the question of interest is often one-sided. For example, can we show that the experimental treatment is not worse that the standard treatment by as much as δU , say? The hypotheses are now

H0 : δ ≥ δU versus HA : δ < δU

We make a Type I error (α) if we falsely conclude equivalence, i.e., we reject H0 when H0 is true, which means that we obtain an upper confidence limit less than δU when the true value is greater than δU . A Type II error (β) is a failure to conclude equivalence when the treatments are similar, i.e., we obtain an estimate of the upper confidence internal limit which is greater than δU when the true value is a value less than δU . It is desirable to keep both α and β small, especially α, not least because the ‘cost’ to the drug companies of falsely declaring equivalence would be high. The sample size formula for comparing two true means (i.e., average equivalence) using the one-sided hypotheses as described above, is 2 2σ 2 n = 2 (zα + zβ) (δU − δ)

c School of Mathematical Sciences, University of Adelaide 57 Biostatistics III

which is the same as the conventional hypothesis formula with δ1 replaced by δU − δ (recall that δ1 is the minimum difference to be detected in the standard ‘superiority’ trial). If δ = 0, the formulas are the same and we obtain the same sample size for an equivalence trial as for the standard null hypothesis of no difference with δU as the minimum difference to be detected. As for standard one-sided hypothesis testing, there are various methods for compar- ing two proportions. With equal numbers of patients in the two groups, the number in each group is given by

 2 zα + zβ n = {πS(1 − πS) + πE(1 − πE)} . δU − (πS − πE)

Often it is appropriate to take πS = πE = π and the formula simplifies to

z + z 2 n = 2π(1 − π) α β . δU

c School of Mathematical Sciences, University of Adelaide 58 Biostatistics III

3 Epidemiology and observational studies

3.1 Introduction

Epidemiology incorporates all aspects of the study of disease in human populations, other than designed experiments. Recall that we introduced some of these ideas in Chapter 1. Key references:

• Armitage et al (2002)

• Jewell (2003)

• Clayton & Hills (1993)

• EoB (1998, 2005) ← advanced.

Examples:

• nutritional/dietary studies, including studies of the impact of genetically modified foods

• occupational health: early studies of the relationship between exposure to asbestos and mesothelioma (James Hardie Industries saga)

• long-term health effects from exposure to environmental electromagnetic radiation

• the relationship between childhood leukaemia and mothers’ exposure to radiation

• exposure to cadmium soil contamination and illness (recent incidence at West Lakes, SA)

• surveillance of bioterrorism and other public health threats

• studies of risk of vCJD from dietary exposure to BSE agents

• ‘small area’ prediction of Ross River fever; meningitis

• monitoring the spread of HIV, TB, malaria; forecasting the future spread of infection.

c School of Mathematical Sciences, University of Adelaide 59 Biostatistics III

We know it is often not practical or ethical to conduct experiments on people. It is therefore necessary to conduct observational studies and analyse observational data. In general, observational studies produce weaker conclusions than well-designed experiments. In particular, an allows us to infer that an associ- ation is present, but not that the relationship is causal. In principle, the distinction between experiments and observational studies is clear cut and important. However, in practice, the distinction can become blurred, and strong conclusions can be drawn from well-designed epidemiological studies.

Types of observational studies:

We distinguish

(i) A cohort study - this is a prospective longitudinal study. Observations are made on individuals at entry to the study, they are then followed forward in time, and explanatory and response variables are recorded for each individual; e.g., Multicenter AIDS Cohort Study (MACS) www.statepi.jhsph.edu/macs

(ii) A case-control study - this is a retrospective (longitudinal) study. The response is recorded at entry to the study, and an attempt is made to look backwards in time for possible explanatory features (i.e., exposure to risk).

(iii) A cross-sectional study - each individual is observed at just one point in time. In (i) to (iii) above, the investigator may have substantial control over which individuals are included, and over the measuring process used.

(iv) A secondary analysis - the investigator only has control over the inclu- sion or exclusion of individuals for analysis; e.g., National AIDS Registry data; e.g., S.A. Cancer Registry.

Study types (i) to (iv) are in

• decreasing order of effectiveness

• the (prospective) cohort study being closest to an experiment;

• and are also in decreasing order of complexity and cost.

c School of Mathematical Sciences, University of Adelaide 60 Biostatistics III

For further background reading, see ‘Theory of the Design of Experiments’ by Cox & Reid (2000).

The two principal types of observational studies are cohort studies and case-control studies.

3.2 Cohort Studies

Consider a disease D, and a potential (or suspected) risk factor for disease, A. We want to determine whether A influences the occurrence of D. In a prospective study, we take a sample of subjects who are exposed to A, and a ‘comparable’ sample who are not exposed. We then track both groups over time, and record a response(s) on each subject; for example, whether or not the subject develops D, or the ‘survival time’ to the development of D. The important difference to clinical trials is that we do not use random allocation to assign the subjects to the two groups.

Some difficulties:

1) The choice of the ‘comparable’ non-exposed group can be problematic. For example, in studying the effects of living at West Lakes, we would not choose the non-exposed sample from the general population because there are almost certainly potential confounders, such as socio-economic status (SES), or exposure to other toxic pollutants, such as living in an industrial area.

2) Low incidence diseases need very large samples (often unrealistically large).

3) Long incubation/latent periods: if the time from exposure to the onset of disease D is long, it will not be practical to conduct a prospective study. Some issues are that: surveillance definitions change over time; people move states/countries, change jobs, etc.

4) Cohort studies are usually complex, more expensive, and take longer.

Major advantages of cohort studies include:

• many medical conditions can be studied simultaneously; and

c School of Mathematical Sciences, University of Adelaide 61 Biostatistics III

• direct, current information is obtained over time. It is therefore easier to calculate incidence rates, absolute risk, etc.

N.B.: Cohort studies can be retrospective - these involve cohort identification after a conceptual follow-up period. For example, an historical prospective study (see handout).

3.3 Case-control Studies

Case-control studies are the most widely used, and are usually retrospective. Here we choose a sample of subjects who are known to have the disease D, and a ‘comparable’ group known not to have the disease. Disease group members = cases Non-diseased = controls We then look back retrospectively to compare their exposure to the risk factor A. There may be multiple exposures, or complex patterns of exposure. For example, select mesothelioma patients from hospitals, and select non-mesothelioma controls, then compare historical exposure to asbestos for the two groups. In practice, we compare the histories and lifetimes of both groups.

The choice of controls is a critical issue: The broad intention (i.e., the ideal) is that the controls should, on average, be similar to the cases in all respects except for the disease D under study and the associated risk factors. In other words, the controls *could* have the disease, but do not. For ex- ample, in the asbestos case-study, it would not be appropriate to sample the controls from the general population.

Problems with choice of controls:

• Selection bias, e.g., the controls are younger than the cases. One impor- tant reason for obtaining wrong answers from case-control studies is in- correct sampling of the controls (or cases) from the study base.

• The cases are a higher sampling fraction of all cases, than are the controls.

• Non-response bias.

• Recall bias.

c School of Mathematical Sciences, University of Adelaide 62 Biostatistics III

• Other biases, for example, bias in responders.

We usually match the controls for age and sex, so that the age and sex distributions are similar; other factors often vary with age, sex. Note that matching is usually ad- vocated on the grounds of efficiency, but the main benefit is to reduce confounding and avoiding selection bias (this is known as ‘de-confounding’). Cases often arise from hospitals → choose controls from the same hospital popula- tion with different illnesses. The idea here is that these patients will share character- istics such as SES, environmental conditions, etc. Some problems with hospital controls are that

• the catchment populations for specialist hospitals may not coincide;

• patients sick with other diseases may not represent the population of people free of the disease D. For example, factors associated with an increased risk of these different diseases may appear to be protective against the disease of interest in the cases, because they are over-represented in the controls.

An important question: How many controls? → usually 1, 2 or 3 for each case.

Case-control studies are very useful when

• the disease D is rare;

• there is a long incubation period;

• for handling large datasets.

Exposure ascertainment (assessment) can be a problem though. Also, it is important to draw attention to the fact that the best sampling scheme can be invalidated by poor patient compliance.

3.4 Other designs

[See Clayton and Hills.] In a case-cohort or case-base study, the controls are selected as a random sample of the cohort at the beginning of the study. If the disease is rare and there is little

c School of Mathematical Sciences, University of Adelaide 63 Biostatistics III loss to follow-up, then the analysis (see later) may be carried out as usual after first removing from the control sample any individuals who later become cases. Incidence density sampling in a nested case-control study (matching on time) is use- ful when assessing exposure is expensive or difficult. Here, we select as controls individuals who are disease-free at the time of diagnosis of the cases. (A nested case-control study is a case-control study nested in a cohort.)

3.5 Binary responses and case-control studies

A binary response is a Bernoulli response variable, Y . We have met binary responses in clinical trials:

 1 patient recovers e.g., Y = 0 otherwise and in epidemiological studies:

 1 patient has disease D e.g., Y = 0 otherwise.

Consider a large population of patients, and let

π be the true proportion with the disease D; π can be interpreted as the probability a randomly chosen patient has D; π is also called the risk of disease or the disease rate.

Suppose now that we wish to compare the disease rates π1, π2 for two different sub- populations defined by exposure (π1) or non-exposure (π2) to a certain risk factor R.

Note that π1 is the probability of D amongst those at risk (exposed), and π2 is the probability of D amongst those not exposed.

The following quantities are often used to model disease association:

1) risk difference: ∆ = π1 − π2

2) relative risk (or risk ratio): π φ = 1 π2

c School of Mathematical Sciences, University of Adelaide 64 Biostatistics III

3) odds ratio:   π1

1 − π1 ψ =   π2

1 − π2

Note that < 0 < 1 < 1 ∆ = 0 ⇐⇒ φ = 1 ⇐⇒ ψ = 1 > 0 > 1 > 1

In general, ψ is less simple to interpret than φ or ∆.

However, if πi ≈ 0 for i = 1, 2, then   π1

πi 1 − π1 ≈ πi =⇒   1 − πi π2

1 − π2 π ≈ 1 π2

= φ i.e., the odds ratio is approximately equal to the relative risk if the disease is rare in both populations. This is called the ‘rare disease assumption’, and note that it can be relaxed by using particular designs.

ψ has the following useful property:

Consider a population of N1 individuals exposed to the risk factor R, and N2 non- exposed individuals, where N = N1 + N2. Consider also the table of population frequencies that we can construct: Disease (D) yes no Total Exposed (R) yes N1π1 N1(1 − π1) N1 no N2π2 N2(1 − π2) N2 Total N1π1 + N2π2 N1(1 − π1) + N2(1 − π2) N1 + N2 = N

1) A prospective probability model: Consider a prospective (cohort) study. In this case, the probability that an individual chosen randomly from the exposed sub-population has the N1π1 disease is π1 (either by definition, or using the probability ). N1

c School of Mathematical Sciences, University of Adelaide 65 Biostatistics III

Similarly, the probability of disease for a randomly chosen individual N2π2 from the non-exposed population is π2 (either by definition or via ). N2 The odds ratio of these two probabilities is   π1

1 − π1 ψ =  . π2

1 − π2

2) A retrospective probability model: Consider a case-control study. Here we choose a sample of cases from the (sub)population of diseased individuals. If we randomly choose an individual with the disease, the probability that they were exposed is

N1π1 number of diseased exposed ρ1 = = N1π1 + N2π2 total number of D and the odds of exposure amongst the (diseased) cases is

ρ N π /(N π + N π ) 1 = 1 1 1 1 2 2 1 − ρ1 N2π2/(N1π1 + N2π2) N π = 1 1 . N2π2

Similarly, if ρ2 is the probability that a person chosen randomly from the non-diseased (or control) population has been exposed to the risk factor, then N1(1 − π1) ρ2 = N1(1 − π1) + N2(1 − π2) and ρ N (1 − π ) 2 = 1 1 . 1 − ρ2 N2(1 − π2) This is the odds of exposure amongst the non-diseased (or controls).

c School of Mathematical Sciences, University of Adelaide 66 Biostatistics III

Hence the odds ratio

ρ1     1 − ρ N1π1 N1(1 − π1) 1 = / ρ2 N2π2 N2(1 − π2) 1 − ρ2 N π N (1 − π ) = 1 1 2 2 N2π2N1(1 − π1) π (1 − π ) = 1 2 π2(1 − π1) π π = 1 / 2 1 − π1 1 − π2

= ψ.

So although we clearly cannot estimate π1, π2 from the case-control data (since we choose the number with and without D), we can estimate ψ, the ratio of the odds of disease amongst the exposed and non-exposed. This tells us that the estimate of the odds ratio does not depend on the study design employed.

Note:

1) In prospective sampling/design

π1 = P (D|R)

¯ π2 = P (D|R)

2) In retrospective sampling/design

ρ1 = P (R|D)

¯ ρ2 = P (R|D)

3.6 Estimation and inference for measures of association

To begin, consider the retrospective (case-control) design.

Consider data comprising m1 diseased subjects of which X1 have been exposed to R, and m2 disease-free subjects of which X2 have been exposed to R.

c School of Mathematical Sciences, University of Adelaide 67 Biostatistics III

The exact distributions of X1,X2 are hypergeometric (i.e., sampling without replace- ment), and independent. In most applications where the sample is small compared to the underlying populations, it is adequate to treat them as independent binomials (i.e., sampling with replacement). That is,

X1 ∼ B(m1, ρ1)

X2 ∼ B(m2, ρ2)

Under this assumption, we estimate ρ1 by

Xi ρˆi = i = 1, 2. mi

For large mi   ρi(1 − ρi) ρˆi ∼: N ρi, . mi

We estimate ρ ρ ψ = 1 / 2 1 − ρ1 1 − ρ2 by ρˆ ρˆ ψˆ = 1 / 2 . 1 − ρˆ1 1 − ρˆ2

For large samples, it can be shown (using the theory of maximum likelihood) that  1 1 1 1  log ψˆ ∼: N log ψ, + + + m1ρ1 m1(1 − ρ1) m2ρ2 m2(1 − ρ2)

The sampling distribution of log ψˆ is more symmetric than that of ψˆ, and is thus better approximated by a normal distribution in large samples. Since the mean of log ψˆ is close to the true log ψ when n is large, it follows that log ψˆ − log ψ has an approximately normal sampling distribution with zero expectation, and vari- ance V , say. In practice, we estimate via  X X  log ψˆ = log 1 / 2 m1 − X1 m2 − X2

c School of Mathematical Sciences, University of Adelaide 68 Biostatistics III and the standard error for log ψˆ by

r 1 1 1 1 s.e.c = + + + X1 m1 − X1 X2 m2 − X2 We will see where these estimates come from shortly.

Important remarks:

1) Several authors use the notation with ‘observed frequencies’

D D¯ (cases) (controls) (exposed) R a b n1 ¯ (non-exposed) R c d n2 m1 m2

In this notation, we find ad log ψˆ = log bc and r1 1 1 1 s.e. = + + + . c a b c d 2) To obtain a confidence interval for ψ: in practice, the best way to obtain a confidence interval ψ is to obtain one for log ψ and back-transform the end-points to get a c.i. for ψ (i.e., exponentiate).

ψL, ψU obtained in this way are known as logit limits; they tend to be too narrow, especially if any of the cell frequencies are small.

3.6.1 Finding the approximate variance in a cohort study

To understand the variance in a cohort study, write

 πˆ   πˆ  log ψˆ = log 1 − log 2 1 − πˆ1 1 − πˆ2 where πˆ1 is the observed proportion of D’s amongst the exposed, and πˆ2 is the ob- served proportion of D’s amongst the non-exposed.

π1(1−π1) We know that the variance of πˆ1 is binomial, i.e., , where n1 is the number of n1 exposed subjects in the sample, and similarly for πˆ2.

c School of Mathematical Sciences, University of Adelaide 69 Biostatistics III

  The key to estimating the variance V is then to get a simple approximation of log πˆ1 1−πˆ1 in terms of πˆ1.   πˆ1 Consider log , and expand in a Taylor series about π1, which gives 1−πˆ1     πˆ1 π1 1 log ≈ log + (ˆπ1 − π1) . 1 − πˆ1 1 − π1 π1(1 − π1) Since the first term on the r.h.s. of this expression is constant, this immediately shows that   πˆ1 1 Var log ≈ Var(ˆπ1) 2 1 − πˆ1 {π1(1 − π1)} 1 = . n1π1(1 − π1)

We estimate this variance by plugging in πˆ1 for π1, as usual, to give  πˆ  1 Var log 1 ≈ 1 − πˆ1 n1πˆ1(1 − πˆ1) 1 1 = + . a b

Exactly the same calculation leads to  πˆ  1 1 Var log 2 ≈ + . 1 − πˆ2 c d

Since the exposed and non-exposed are independent samples, we then get an esti- mate of V by adding these two formulae, so that an effective estimate of the sam- pling variance of log ψˆ is 1 1 1 1 Var(log\ψˆ) = + + + , a b c d applicable to both cohort and case-control designs.

To summarize: the approximate sampling distribution of

log ψˆ − log ψ q Var(log\ ψˆ) is N(0, 1). Thus two-sided 100(1 − α)% confidence limits for log ψ are given by q ˆ \ ˆ log ψ ± zα Var(log ψ)

c School of Mathematical Sciences, University of Adelaide 70 Biostatistics III

where zα is the (1 − α/2)th percentile of the standard normal distribution. To obtain the relevant confidence limits for ψ, we simply anti-log, i.e., exponentiate the limits for log ψ.

Using the δ-method to find an approximate variance for ψˆ:

The delta-method: for a random variable X, the δ-method gives expressions for the approximate mean and variance for g(X) (see Tutorial). Consider a random variable

X ∼ (µ, σ2). Then E{g(X)} ≈ g(µ)

Var{g(X)} ≈ σ2{g0(µ)}2 to first order. Here, take X to be ψˆ take g(X) to be log ψˆ. Then using the δ-method,

 1 2 Var(log ψˆ) ≈ Var(ψˆ) ψ

 1 2 = σ2 ψ ψ

2 2 ˆ ∴ σψ ≈ ψ Var(log ψ)

ad2 1 1 1 1 σ2 = + + + . ∴ ψ bc a b c d

Note that this is Var(ψˆ) ' (ψˆ)2Var(log\ψˆ).

Example: See handout from Clayton & Hills p.157. This is a case-control study. The controls are a cross-sectional survey of the whole population (this allows us to use pre-existing data for the control group). It will be

c School of Mathematical Sciences, University of Adelaide 71 Biostatistics III the case that some members of the control group also have the disease, but for a rare disease, the effect is negligible. What is the appropriate probability model?

The odds of vaccination amongst the leprosy cases is

101/260 101 = 159/260 159

= 0.6352.

The odds of vaccination amongst the healthy controls is 46, 028 = 1.3305, 34, 594 so the odds ratio is 101/159 101 × 34, 594 = 46, 028/34, 594 159 × 46, 028 0.6352 = 1.3305

= 0.4774.

Interpretation: vaccination halves the risk. This is the extent of protection against leprosy afforded by vaccination. Now, log ψˆ = −0.7394 and √ s.e.\(log ψˆ) ' 0.016241

1 1 1 1 Var(log\ψˆ) = + + + a b c d ∴ an approximate 95% c.i. for log ψ is

log ψˆ ± 1.96s.e.\(log ψˆ) i.e., −0.7394 ± 1.96 × 0.1274.

c School of Mathematical Sciences, University of Adelaide 72 Biostatistics III

Thus

(log ψL, log ψU ) = (−0.9838, −0.4842)

=⇒ (elog ψL , elog ψU ) = (0.3739, 0.6162)

We conclude, with 95% confidence, that the vaccinated individuals are between 37% and 62% as likely as the non–vaccinated to have leprosy. Clearly the confidence interval does not contain 1, so we can also conclude that ψ is significantly different to 1. Since leprosy is a rare disease, we make the same conclusions for φ.

???????

3.7 Attributable risk

Definition: Attributable risk (AR) is the proportion of cases in the total population attributable to the risk factor, R. AR is usually calculated when it is justified to infer causation from an observed as- sociation.

We are assuming exposure is harmful (π1 > π2). Consider, as before, a disease D and risk factor R, and the following population (=study base) frequencies:

D D¯ R N1π1 N1(1 − π1) ¯ R N2π2 N2(1 − π2)

The attributable risk is defined to be N (π − π ) AR = 1 1 2 . N1π1 + N2π2

To see how the definition arises, observe that the total number of cases is

N1π1 + N2π2 (∗)

c School of Mathematical Sciences, University of Adelaide 73 Biostatistics III

Suppose now that the risk of D in the exposed (sub)population was set to π2; then the total number of cases would be

N1π2 + N2π2 = (N1 + N2)π2

Then the surplus number of cases attributable to R is

N1π1 + N2π2 − (N1π2 + N2π2) = N1(π1 − π2).

Expressing this surplus as a proportion of the total number of cases gives: N (π − π ) AR = 1 1 2 N1π1 + N2π2

[NB. If π1 < π2, then the risk factor R is protective, and we use alternative methods.] The AR can also be expressed as N (π − π )/π AR = 1 1 2 2 (N1π1 + N2π2)/π2   π1 N1 − 1 π2 =   π1 N1 + N2 π2 N (φ − 1) = 1 N1(φ − 1) + N1 + N2

π1 where φ = is the relative risk. π2

Divide the numerator and denominator by N1 + N2, then N 1 (φ − 1) N + N AR = 1 2 N 1 + 1 (φ − 1) N1 + N2 This shows that the AR ≈ 0 when the numerator is small, and the AR is large only if the numerator is large. This means that to obtain a big AR, we must have both

N1 i) ‘not too small’ (approximately 1:1000), and N1 + N2 ii) φ >> 1.

c School of Mathematical Sciences, University of Adelaide 74 Biostatistics III

One final form of the AR that we use to define an estimate from case-control data is N (π − π ) AR = 1 1 2 N1π1 + N2π2 N π  π  = 1 1 1 − 2 N1π1 + N2π2 π1  1  = θ 1 − , 1 φ where θ1 is the proportion of exposed cases in the diseased population. This tells us that we can estimate the AR from the relative risk, and the proportion of the population exposed to R.

3.7.1 Estimation of AR

Consider a table of frequencies obtained from a case-control study:

D D¯ R a b R¯ c d a + c b + d

Observe that ˆ 1) θ1 is estimable from case-control data, and the obvious estimate is θ1 = a . a + c 2) φ not directly estimable, but for rare diseases φ ' ψ.

So we take ad ψˆ = bc and substitute for the relative risk φ to obtain   ˆ ˆ 1 AR = φ1 1 − ψˆ a  bc  = 1 − a + c ad ad − bc = . d(a + c)

c School of Mathematical Sciences, University of Adelaide 75 Biostatistics III

Finally, we derive an approximate standard error for ARˆ . Observe first that d(a + c) − (ad − bc) 1 − ARˆ = d(a + c) c(b + d) = d(a + c)

1 − θˆ = 1 , ˆ 1 − θ2 where θ2 is the proportion of exposed controls. Hence, ˆ ˆ ˆ log(1 − AR) = log(1 − θ1) − log(1 − θ2). ˆ ˆ Next, observe that θ1 and θ2 are simply observed proportions from two independent binomial samples. Hence, ˆ θ1(1 − θ1) Var(θ1) = m1

ˆ θ2(1 − θ2) Var(θ2) = m2

m1 = number of cases

m2 = number of controls

So by the δ-method, we find ˆ 1 ˆ 1 ˆ Var{log(1 − AR)}' 2 Var(θ1) + 2 Var(θ2) (1 − θ1) (1 − θ2) θ θ = 1 + 2 . m1(1 − θ1) m2(1 − θ2)

We estimate this by a b Var{log(1\− ARˆ )} = + . c(a + c) d(b + d)

Furthermore, the large-sample distribution of log(1 − ARˆ ) is asymptotically normal. We obtain an approximate confidence interval for AR by back-transforming the con- fidence limits for log(1 − AR).

c School of Mathematical Sciences, University of Adelaide 76 Biostatistics III

4 Inference for the 2x2 table

4.1 Introduction

Consider the 2x2 table:

D D¯ R a b n1 = a + b ¯ R c d n2 = c + d m1 = a + c m2 = b + d n1 + n2

Depending on the sampling scheme we will model the prospective case:

a ∼ B(n1, π1)

c ∼ B(n2, π2) independently or the retrospective case:

a ∼ B(m1, ρ1)

b ∼ B(m2, ρ2) independently.

In both cases we want inference about π /(1 − π ) ρ /(1 − ρ ) ψ = 1 1 = 1 1 . π2/(1 − π2) ρ2/(1 − ρ2)

4.2 Wald tests

An obvious way to proceed is to take ad ψˆ = . bc

We can then prove (using the δ-method) that

c School of Mathematical Sciences, University of Adelaide 77 Biostatistics III

(prospective case) 1 1 1 1 Var(log ψˆ) = + + + n1π1 n1(1 − π1) n2π2 n2(1 − π2) 1 1 1 1 =⇒ Var(log\ψˆ) = + + + . a b c d

Suppose we wish to test

H0 : ψ = 1

⇐⇒ H0 : log(ψ) = 0

In this case we use a Wald Test, i.e., take log(ψˆ) Z0 = r1 1 1 1 + + + a b c d

Under H0, the asymptotic distribution of Z0 is N(0, 1). So to obtain a test with significance level 2α, we reject when

(α) |Z0| ≥ Z .

An unsatisfactory aspect of Wald tests is that they are not invariant under transfor- mations of the parameters. For example, observe that

π1/(1 − π1) ψ = = 1 ⇐⇒ π1 − π2 = 0 π2/(1 − π2)

Hence we could instead use the standard Wald test for two binomial proportions. In a c the prospective case, we take πˆ1 = and πˆ2 = and since πˆ1, πˆ2 are independent, n1 n2 we find π1(1 − π1) π2(1 − π2) Var (ˆπ1 − πˆ2) = + . n1 n2

We can therefore define an alternative Wald test statistic by

πˆ1 − πˆ2 Z1 = . rπˆ (1 − πˆ ) πˆ (1 − πˆ ) 1 1 + 2 2 n1 n2

c School of Mathematical Sciences, University of Adelaide 78 Biostatistics III

For the retrospective model, the Wald test statistic is of the form

ρˆ1 − ρˆ2 Z1 = . rρˆ (1 − ρˆ ) ρˆ (1 − ρˆ ) 1 1 + 2 2 m1 m2

Observe that Z0, Z1 are different test statistics for the same H0. There is no theoretical reason to prefer one over the other.

For large samples at least, the different Wald statistics are often numerically similar.

Example: In a retrospective study of the association between Hodgkin’s disease and tonsillectomy, the following data were recorded:

Hodgkin’s (D) yes no Total Tonsillectomy (R) yes 90 165 255 no 84 307 391 Total 174 472 646

90 × 307 log 84 × 165 In this case we find Z0 = r 1 1 1 1 + + + 90 84 165 307

=⇒ Z0 = 3.8367.

Clearly reject H0 and conclude that there is a significant association (i.e., ψ 6= 1) between Hodgkin’s disease and tonsillectomy.

Here, ψˆ = 1.9935, and tonsillectomy is associated with almost twice the risk of Hodgkin’s disease.

90 165 To get Z1, observe ρˆ1 = , ρˆ2 = , then 174 472

ρˆ1 − ρˆ2 Z1 = = 3.8296, rρˆ (1 − ρˆ ) ρˆ (1 − ρˆ ) 1 1 + 2 2 m1 m2 and we reach the same conclusion.

c School of Mathematical Sciences, University of Adelaide 79 Biostatistics III

4.3 Likelihood Ratio test

An alternative method of inference is based on the likelihood: Consider a statistical problem in which the distribution of data X depends on an unknown parameter θ. That is, the probability distribution is p(x; θ). The log likelihood is then

`(θ; x) = log{p(x; θ)}.

Given a likelihood function `(θ; x), we can

i) Estimate θ by maximising the likelihood,

i.e., θˆ = arg max `(θ; x). θ

∂` Usually, the solution to = 0. ∂θ ii) Under regularity conditions, we can find a large sample variance for θˆ. 1 That is, we use Var(θˆ) = , where I(θ)

 ∂2  I(θ) = E − `(θ; x) . ∂θ2

iii) We can obtain tests of H0 : θ = θ0. The Likelihood Ratio (LR) test statistic is

2 ˆ G = 2[`(θ; x) − `(θ0; x)].

For large samples (with θ scalar) the null hypothesis distribution of G2 is 2 approximately χ1.

Recall our situation:

D D¯ R a ≡ X1 b n1 ¯ R c ≡ X2 d n2 m1 m2

c School of Mathematical Sciences, University of Adelaide 80 Biostatistics III

We assume Xi ∼ B(ni, πi) independently, and we want to make infer- ences for π /(1 − π ) ψ = 1 1 . π2/(1 − π2) We stated above that the likelihood approach to hypothesis testing for a scalar parameter θ led to the LRT statistic

2 ˆ G = 2[`(θ) − `(θ0)].

2 2 For large samples, G has approximately the χ1 distribution if

H0 : θ = θ0 holds.

However, in the case of the 2x2 table, we cannot apply this directly be- cause there is also a nuisance parameter. In particular, it is natural to pa- rameterize (π1, π2) by

π1/(1 − π1) π2 ψ = , ω2 = . π2/(1 − π2) 1 − π2

In this case we treat ω2 as a nuisance parameter. There are various ap- proaches to eliminating the nuisance parameter ω2.

Before we consider these various approaches, we will consider the likelihood and the afore-mentioned parameterisation: Consider the prospective model with:

a + b = number exposed  these are considered fixed. c + d = number unexposed

The model is a ’product binomial model’, where

a b c d L ∝ π1 (1 − π1) π2(1 − π2)

 a  c π1 a+b π2 c+d = (1 − π1) (1 − π2) . 1 − π1 1 − π2

π1 π2 Let w1 = , w2 = . 1 − π1 1 − π2

w1 Now, interest is in ψ = . So we reparametrize in terms of w2 and ψ, i.e., write w2 w1 = ψw2.

c School of Mathematical Sciences, University of Adelaide 81 Biostatistics III

We treat w2 as a ’nuisance parameter’. Then  a+b  c+d a 1 c 1 L = (ψw2) w2 1 + ψw2 1 + w2 and

` = log L

= a log(ψw2) − (a + b) log(1 + ψw2) + c log w2 − (c + d) log(1 + w2).

Now consider the approaches to eliminating the nuisance parameter ω2.

4.3.1 Profile Likelihood

The idea is to eliminate ω2 by maximizing over it, i.e., to replace ω2 with its most likely value. In particular, consider the likelihood function `(ψ, ω2; x). We can define ωˆ2(ψ) = arg max `(ψ, ω2; x) for each ψ. ω2 The profile likelihood is then

`p(ψ; x) = `(ψ, ωˆ2(ψ); x).

Provided the number of nuisance parameters is fixed, the profile likelihood behaves asymptotically like an ordinary likelihood function. This leads to the LRT statistic 2 ˆ G = 2[`p(ψ) − `p(ψ0)], ˆ where ψ = arg max `p(ψ; x). ψ

2 2 Under H0 : ψ = ψ0, G has approximately the χ1 distribution for large samples.

For the comparison of two binomial probabilities π1, π2, we test H0 : ψ = 1. It can be shown that  P  1 − P  G2 = 2 n P log 1 + n (1 − P ) log 1 1 1 P 1 1 1 − P P  1 − P  +n P log 2 + n (1 − P ) log 2 2 2 P 2 2 1 − P

c School of Mathematical Sciences, University of Adelaide 82 Biostatistics III

Xi where Pi = ni X1 + X2 and P = . n1 + n2

To see how this formula arises, we use the fact that the likelihood function is invari- ant under transformation of the parameters. According to our definition,   2 G = 2 max `(ψ, ω2) − max `(1, ω2) . (ψ,ω2) ω2

However, it is much simpler to observe that the log likelihood can equivalently be ∗ paramaterized by π1, π2. Writing this likelihood as ` (π1, π2; x), where

∗ `(ψ, ω2; x) = ` (π1, π2; x), π /(1 − π ) ψ = 1 1 , π2/(1 − π2)

π2 ω2 = , 1 − π2 we have 2 ∗ ∗ G = 2( max ` (π1, π2; x) − max ` (π, π; x)). (π1,π2) (π)

Now

∗ ` (π1, π2; x) = x1 log π1 + (n1 − x1) log(1 − π1) + x2 log π2 + (n2 − x2) log (1 − π2).

To maximize, observe that we can treat each term separately.

c School of Mathematical Sciences, University of Adelaide 83 Biostatistics III

Taking

g(π1) = x1 log π1 + (n1 − x1) log (1 − π1),

0 x1 (n1 − x1) g (π1) = − π1 1 − π1 x (1 − π ) − π (n − x ) = 1 1 1 1 1 π1(1 − π1) x − n π = 1 1 1 π1(1 − π1)

= 0

x1 ⇐⇒ π1 = . n1

x1 Hence πˆ1 = P1 = . n1

x2 Similarly, πˆ2 = P2 = . n2

Finally, we find by the same method, x + x arg max `∗(π, π; x) = 1 2 . π n1 + n2

Example: Hodgkin’s disease versus Tonsillectomy. G2 = 14.191 with 1 df (for comparison with previous Wald test). √ G2 = 3.767 (LRT invariant, always the same).

In what follows, consider the following 2 × 2 table:

D D¯ R a ≡ x1 b n1 ¯ R c ≡ x2 d n2 m1 m2 N

We now consider a second approach to eliminating the nuisance parameter.

c School of Mathematical Sciences, University of Adelaide 84 Biostatistics III

4.3.2 Conditional Inference

General description:

Consider a problem in which p(x; ψ, ω2) is given, ψ is the parameter of interest and ω2 is a nuisance parameter.

Another way to eliminate ω2 is to find a sufficient statistic for ω2, and to condition on it. That is, we would like to find s(x) such that

p(x; ψ, ω2) = q(x; ψ)h(s(x); ω2).

Because then we could calculate the conditional likelihood:

p(x; ψ, ω2) p(x|s(x); ψ, ω2) = X 0 p(x ; ψ, ω2) x0:s(x0)=s(x)    In this case A = {X = x}   B = {s(X) = s(x)} 

q(x; ψ)h(s(x); ω ) = 2 X 0 0 q(x ; ψ)h(s(x ); ω2) x0:s(x0)=s(x)

q(x; ψ) = X . q(x0; ψ) x0:s(x0)=s(x)

Then we can use the conditional likelihood p(x|s(x); ψ) to make inference about ψ.

For the problem at hand, observe that n  n  1 x1 n1−x1 2 x2 n2−x2 p(x1, x2) = π1 (1 − π1) π2 (1 − π2) x1 x2 n n   π x1  π x2 1 2 1 n1 2 n2 = (1 − π1) (1 − π2) . x1 x2 1 − π1 1 − π2

π1/(1 − π1) π2 Now let ψ = , ω2 = ; π2/(1 − π2) 1 − π2

c School of Mathematical Sciences, University of Adelaide 85 Biostatistics III

ω2 1 and observe π2 = =⇒ 1 − π2 = , 1 + ω2 1 + ω2 ψω2 1 and π1 = =⇒ 1 − π1 = . 1 + ψω2 1 + ψω2

Hence we have n n   1 n2  1 n1 1 2 x1 x2 p(x1, x2; ψ, ω2) = (ψω2) ω2 x1 x2 1 + ω2 1 + ψω2 n n   1 n2  1 n1 1 2 x1 (x1+x2) = ψ ω2 x1 x2 1 + ω2 1 + ψω2 n n   1 n2  1 n1 1 2 x1 s(x1,x2) = ψ ω2 . x1 x2 1 + ω2 1 + ψω2

Hence n  n   1 1  1 2 x1 m1 ψ ω2 n n x1 m1 − x1 (1 + ω2) 2 (1 + ψω2) 1 p(x1|x1 + x2 = m1; ψ, ω2) =      X n1 n2 x0 m 1 1 ψ 1 ω 1 0 0 2 n2 n1 0 x1 m1 − x1 (1 + ω2) (1 + ψω2) x1    n1 n2 ψx1 x1 x2 =    . n1 n2 0 X x1 0 0 ψ 0 x1 m1 − x1 x1

We can now make inference for ψ, based on the conditional distribution of x1|x1 + x2 = m1. For convenience, call    n1 n2 ψx1 ∗ x1 m1 − x1 q (x1; ψ) =    n1 n2 0 X x1 0 0 ψ 0 x1 m1 − x1 x1

 0   0 ≤ x1 ≤ n1  (Non-central hypergeometric) 0  0 ≤ m1 − x1 ≤ n2 

The conditional maximum likelihood estimate for ψ can be found using ˆ ∗ ψCML = arg max q (x1; ψ) ψ

c School of Mathematical Sciences, University of Adelaide 86 Biostatistics III

∗ def ∗ If we want to test H0 : ψ = 1, we need only consider q (x1; ψ) = q0(x1). Substituting ψ = 1, we find    n1 n2

x1 m1 − x1 q0(x1) =   n1 + n2

m1    n1 n2 x m − x = 1 1 1  N 

m1    m1 m2 x n − x = 1 1 1 . N 

n1

To summarize, we have shown if X1, X2 are independent with Xi ∼B(ni, πi), then

∗ def q (x1) = P [X1 = x1|X1 + X2 = m1]    n1 n2 ψx1 ∗ x1 m1 − x1 q (x1) =    . X n1 n2 ψu u m − u u 1

Tests of H0 : ψ = 1 are based on the null hypothesis distribution

∗ ∗ q0(x1) = q (x1)|ψ=1

   n1 n2 x m − x = 1 1 1  N 

m1    m1 m2 x n − x = 1 1 1 . N 

n1

To test H0, we can use

c School of Mathematical Sciences, University of Adelaide 87 Biostatistics III

1) An approximate Z-test: ∗ m1 x1 + x2 i.e., observe from q0(x) we have E[X1] = n1P , where P = = N n1 + n2 N − n1 and Var(X1) = n1P (1 − P ). N − 1 So we can use the approximate Z-statistic x − n P Z = 1 1 . rN − n 1 n P (1 − P ) N − 1 1

[(n1 + n2)x1 − n1(x1 + x2)]/(n1 + n2) Observe Z = r 1 n n P (1 − P ) N − 1 1 2 n x − n x = 2 1 1 2 rN(n + n ) 1 2 n n P (1 − P ) N − 1 1 2 (n x − n x )/(n n ) = 2 1 1 2 1 2 r N n + n 1 2 P (1 − P ) N − 1 n1n2

P1 − P2 = s , N  1 1  + P (1 − P ) N − 1 n1 n2

xi where Pi = . ni Note: The previous Z-statistic differs from a score test statistic for testing N H0 : π1 = π2 only in the inclusion of other term . N − 1

The approximate significance level is obtained from the N(0, 1) distribu- tion. Provided that the cell frequencies all exceed 5, the normal approximation is adequate.

2) If the cell frequencies are small, an exact P -value can be obtained from ∗ q0 (x). This is called Fisher’s Exact Test. In this case we define:

X ∗ P -value = q0(u) ∗ ∗ u:q0 (u)≤q0 (x1)

c School of Mathematical Sciences, University of Adelaide 88 Biostatistics III

This corresponds to the following:

Figure 16: Hypergeometric distribution

In practice, we can evaluate the exact P -value as follows:

1) Find all legitimate values for x1. The easiest way to do this for small samples is to list all pos- sible 2x2 tables of non-negative integers having the prescribed margins. ∗ 2) Calculate q0(x1) for each legitimate value. 3) Calculate the P -value shown above. (Get the two most extreme values - pick the one with minimum probability etc.)

c School of Mathematical Sciences, University of Adelaide 89 Biostatistics III

5 Tests based on the likelihood

There are three tests connected to the log likelihood, which are all asymptotically equivalent:

(i) Likelihood Ratio test;

(ii) Wald test;

(iii) Score test.

5.1 Wald test statistic

Denoted We; we consider: H0 : θ = θ0 vs. HA : θ = θ0. ˆ ˆ We makes direct use of the m.l.e. θ, and is based on the distance between θ and θ0, the null hypothesis value.

It is defined to be q ˆ ˆ We(x) = (θ − θ0) I(θ), where I(θˆ) is the expected information evaluated at θˆ.  ∂2  ∂` Recall, E − `(θ; Y ) = I(θ) and it is the variance of the score U = , where ∂θ2 ∂θ  ∂`  E = 0, (assuming sufficient regularity, that range of x does not depend on θ). ∂θ 1 1 We also know that Var(θˆ) = and we estimate Var (θˆ) by . I(θ) I(θˆ)

(For vector θ, I(θ) is a matrix, and Var θˆ = I(θ)−1.)

So ˆ (θ − θ0) We(x) = q 1/ I(θˆ)

ˆ ˆ = (θ − θ0)I(θ).

c School of Mathematical Sciences, University of Adelaide 90 Biostatistics III

2 2 Under H0 : θ = θ0, We(x) ∼ N(0, 1) , or equivalently, We(x) is χ1.

Graphically Refer to Figure 17. Note that We is the most accurate quadratic approximation in the

Figure 17: Wald Test. region of the m.l.e.

5.2 Likelihood ratio test statistic

The general form for the LR test statistic for testing

H0 : θ ∈ Θ0 versus Ha : θ ∈ Θ (composite hypotheses)

where Θ0 ⊆ Θ, is

c School of Mathematical Sciences, University of Adelaide 91 Biostatistics III

sup L(θ|x) λ(x) = θ∈Θ0 sup L(θ|x) θ∈Θ

’ratio of likelihoods’

λ takes values in [0, 1] and H0 is rejected if λ is ’too small’.

The level α rejection region is {y : λ(x) ≤ Cα} where Cα is chosen so that

sup P (λ(x) ≤ Cα) ≤ α θ∈Θ0

In general, we need the PDF λ(x) under H0 to find the critical region, but this is often hard or impossible. Happily, Wilk’s theorem comes to the rescue. The theorem is a consequence of the asymptotical normality of the m.l.e.

Wilk’s Theorem

If H0 is true, then 2 −2 log λ is asymptotic χr−s where r is the number of independent parameters in Θ, i.e., dim Θ, the unrestricted parameter space, and s = dim Θ0 (the null hypothesis parameter space), ∀ θ ∈ Θ0.

Note that W (x) = −2 log λ(x) is an equivalent test statistic because a monotonic transformation of a test statistic together with a corresponding transformation of the critical value doesn’t change the partition of the sample space into acceptance or rejection regions, hence it does not change the test procedure.

Consider scalar θ: ˆ W (x) measures the difference in the vertical axis between ` at θ and ` at θ0; refer to Figure 18.

Write the general LR test statistic as

ˆ W (x) = 2[`(θ; x) − `(θ0; x)]

2 For scalar θ, W (x) has an approximate χ1 distribution when H0 : θ = θ0 is true.

c School of Mathematical Sciences, University of Adelaide 92 Biostatistics III

Figure 18: The Likelihood Ratio Test.

In the case of a 2x2 table (i.e., case-control data), we cannot apply this directly be- cause there is also a nuisance parameter w2 , as we have discussed previously. So we base inference about ψ on the profile likelihood `p. Provided the number of nuisance parameters is fixed, `p behaves asymptotically like an ordinary likelihood function.

5.3 Score test statistic

The score test is based on the gradient and curvature of ` at θ0 (the null hypothesis value of the parameter). Refer to Figure 19. The score test statistic is 0 2 ` (θ0) Wu(y) = I(θ0) where I(θ0) is the Fisher Information evaluated at null hypothesis value and is equal to the variance of the score, U. The score test is the most accurate quadratic approximation in the region of the null hypothesis value (and is the locally most powerful test).

c School of Mathematical Sciences, University of Adelaide 93 Biostatistics III

Figure 19: The Score Test.

c School of Mathematical Sciences, University of Adelaide 94