<<

HAR202: Introduction to Quantitative Research Skills

Dr Jenny Freeman [email protected]

Page 1 Contents

Course Outline: Introduction to Quantitative Research Skills 3

Timetable 5

LECTURE HANDOUTS 7 Introduction to study design 7 Data display and summary 18 Sampling with confidence 28 Estimation and hypothesis testing 36 Living with 43 Categorical data 49 Simple tests for continuous data handout 58 Correlation and Regression 69

Appendix 79 Introduction to SPSS for Windows 79 Displaying and tabulating data lecture handout 116 Useful websites*: 133 Glossary of Terms 135 Figure 1: Statistical methods for comparing two independent groups or samples 143 Figure 2: Statistical methods for differences or paired samples 144 Table 1: Statistical methods for two variables measured on the same sample of subjects 145 BMJ Papers 146 Sifting the evidence – what’s wrong with significance tests? 146 Users’ guide to detecting misleading claims in papers 155 Scope tutorials 160 The visual display of quantitative information 160 Describing and summarising data 165 The Normal Distribution 169 Hypothesis testing and estimation 173 Randomisation in clinical investigations 177 Basic tests for continuous Normally distributed data 181 Mann-Whitney U and Wilcoxon Signed Rank Sum tests 184 The analysis of categorical data 188 Fisher’s Exact test 192 Use of Statistical Tables 194 Exercises and solutions 197 Displaying and summarising data 197 Sampling with confidence 205 Estimation and hypothesis testing 210 Risk 216 Correlation and Regression 223

Page 2 Course Outline: Introduction to Quantitative Research Skills

This module will introduce students to the basic concepts and techniques in quantitative research methods. Students will learn how to conduct a research project and use some simple statistical methods to analyse the resultant data.

Aims

1. To introduce students to fundamental concepts and methods in quantitative research methods. 2. To give students an awareness of the processes involved in undertaking quantitative research.

Learning Outcomes

By the end of the unit, a student will be able to: 1. Classify and appropriately display and summarise different types of data. 2. Describe the properties of the Normal distribution. 3. Distinguish between a population and a sample, and understand what is meant by the term ‘standard error’. 4. Explain what a confidence interval is and interpret calculated confidence intervals as applied to , proportions, differences in means, and differences in proportions. 5. Describe the process of setting and testing statistical hypothesis. 6. Distinguish between ‘’ and ‘clinical significance’. 7. Undertake a simple piece of research and report the finding, both as a poster and written report

Group Project Outline. Students and alcohol

In groups you are to conduct a piece of quantitative research about students and alcohol. The groups will be allocated randomly by the lecturer. It is up to you to think of a topic; decide upon a research question to be investigated; formulate a hypothesis; design and conduct a study to test the hypothesis; analyse the results of the study; as a group present the results as a poster; write individual reports on the study findings. The individual reports should be in the style of the BMJ.

Page 3 Assessment

There will be two forms of assessment. 1. In their project groups, the students will be expected to produce and present a poster of the results of their research project (20%). The presentation will be of 10 mins duration. 2. Individually, each student will be expected to produce a project report of between 1,000 to 1,500 words, in the style of a quantitative paper in the BMJ (.e. abstract, introduction, methods, results, discussion and conclusions and references) (80%).

Page 4 Timetable

Date Lecture title

14th Feb Introduction to Module

At the end of this session you should: · Know about the different study designs used in quantitative research · Be able to distinguish between the different types and know when they are appropriate · Be able to distinguish between the strength of evidence provided by the different study designs

21th Feb Displaying data

At the end of this session you should: · Know about the different types of quantitative data and be able to distinguish between them · Be able to display data appropriately using a variety of charts · Calculate basic summary measures · Be aware of the elementary properties of the Normal distribution

28nd Feb Sampling

At the end of this session you should: · Be able to distinguish between a population ad a sample · Know about different methods of sampling · Be able to calculate and understand what is meant by the term standard error(se) and be able to distinguish this from the standard deviation (SD) · Understand what is meant by the term confidence interval

7st March Hypothesis testing

At the end of this session students should: · Know about the process of setting and testing statistical hypotheses · Be able to explain o Null hypothesis o P-value o Type I error o Type II error o Power · Demonstrate awareness that the p-value does not give the probability of the null hypothesis being true and that p>0.05 does not that we accept the null hypothesis · Distinguish between ‘statistical significance’ and ‘clinical significance’

Page 5

14th March Risk

At the end of this session students should: · Know about different measures of risk · Be able to explain o Risk o o Odds and o Absolute risk reduction/excess o · Be familiar with concept of risk ladders

10th April SPSS 1

17th April Analysis of categorical data

At the end of this session students should: · Be able to recognise categorical data · Be able to compare o a single proportion to some pre-specified value o two proportions · Know how to analyse data expressed in frequency tables o 2x2 tables

24th April Analysis of continuous data

At the end of this session students should: · Know the difference between parametric and non-parametic tests · Be aware that data are not non-parametric, it is the test that is · Be able to carry out simple statistical tests o Paired and unpaired t-test o Sign test o Wilcoxon signed rank test o Mann-Whitney U test

1st May Regression and correlation

At the end of this session students should: · Display bivariate qualitative data graphically or in table form · Construct and interpret scatterplots for bivariate quantitative data · Recognise the appropriate uses of correlation and regression · Interpret correlation coefficients and regression equations

8th May SPSS 2

15th May How to mislead with statistics

22nd May Poster Presentation

Page 6 LECTURE HANDOUTS

Introduction to study design

At the end of session, should know about: Study Design • Types of study design commonly used in quantitative research

Dr Jenny Freeman Lecturer in At the end of session, should be able to: • Distinguish between different types of quantitative study design and know when they are appropriate • Distinguish between the strength of evidence provided by different study designs

Quantitative Research Process Main aim of design • Have an idea •Most studies try to relate an input to an • Formulate a hypothesis output • Design study –Do mobile phones cause brain ? • Collect data –Do statinsreduce heart ? •They try to establish the relationship and • Analyse data to quantify it Drawn conclusions • •A good design will have maximum • Disseminate results precision, minimum bias and with fewest resources

Bias present, low precision Bias present, high precision x x x x x x x x x x x Categories of Research Design x x x x x xxx x x x x x x xxxxx xxxxx x xx Research study design can be classified in several ways, for example:

No bias present, low precision No bias present, high precision • Observational or experimental • Prospective or retrospective x x x x x x x x x x x xxx x x x x xxxxx x x x xxxxx • Longitudinal or cross-sectional x x x x x xx

Page 7

Observational or experimental? Observational or experimental?

Observational Experimental • Researcher collects information on the • Researcher deliberately influences events attributes or measurements of interest but does not influence events. Studies of this and investigates the effects of the type include surveys, case-control studies intervention. Studies of this type include and cohort studies randomised controlled trials and many laboratory and animal studies • Observational studies may also be comparative, but they are most commonly • Generally, stronger inferences can be made descriptive from experimental studies • Experimental studies are usually carried out to make comparisons between groups

Prospective or retrospective? Prospective or retrospective? Prospective Prospective •Historical controls • Data are collected forwards in time from • Egcompare survival of patients who have had heart transplant with similar patients before heart transplantation the start of the study became available • Examples include •Before-and-after studies • EgMills et al evaluated whether a Government education (including randomised controlled trials) campaign had increased public knowledge of AIDS- questionnaires sent to a random sample of population and some observational studies (eg before and after campaign. ) •Quasi-experimental studies • To compare groups, some of whom got an intervention and others not-perhaps for administrative convenience

Prospective or retrospective? Cross-sectional Survey

Retrospective Advantages • Cheaper & less time consuming than an • Data refer to past events and may be experimental design acquired from existing sources, such as • Can control for variables during data analysis case notes or interview • Can collect data on a large scale • Examples include case-control studies and some observational studies

Page 8 Longitudinal or cross-sectional? Longitudinal or cross-sectional?

Longitudinal Cross-sectional • Studies which investigate changes over • Studies in which individuals are time, possibly in relation to an intervention observed only once • Examples include most surveys and • Observations are taken on more than one some observational studies occasion • Can be used for: • Examples include randomised controlled • Disease description trials and cohort studies • Diagnosis and staging • May be either prospective or retrospective • Method comparison

Types of study Cross-sectional studies • Cross-sectional survey • •Surveys • Case-control study •Questionnaires • Cohort study •Studies of /randomised controlled trial • /meta-analysis

Cross-sectional Survey Cross-sectional Survey • Studies in which individuals are observed only once • Face to face interview • Examples include most surveys, some observational studies and censuses • Postal survey • Usually descriptive, though some cross-sectional • Telephone survey studies are carried out to investigate associations between a disease and possible • Exploratory survey – describe an • Advantages include there is no loss to follow-up and unfamiliar area no recall bias • Descriptive survey – doesn’t involve the • Relatively cheap and easy to carry out testing of a hypothesis • Disadvantages include sample selection, response rates • Comparative / correlationalsurvey – • Not usually possible to distinguish between case and describes an association between a set of effect variables

Page 9 Cross-sectional Survey Ecological study • Studies in which information on the characteristics and/or exposures of individual members of the Disadvantages population groups are generally not obtained • Dependent upon the reliability and validity of • Existing statistics are used to compare the mortality the questionnaire or morbidity experience of one or more populations with some overall index of exposure • Dependent on the skill of the interviewer • Care is needed to avoid the ‘ecological fallacy’: the • Dependent on the honesty of the respondent assumption that an observed relationship in aggregated data will hold at the individual level • Weaker evidence of cause and effect • Advantages of using routine data is their low cost, relationship ready acceptability and (where official sources are • Dependent upon external validity used) authoritative nature • Major disadvantage is that often the data are not adequate for the purposes of the investigation

Conducting cross-sectional studies Sampling in cross-sectional studies

•Always sample from a defined population •Quota •Try and get a good response rate (>70%) •Convenience •Try and characterise non-responders •Random •Stratified random

Biases in cross-sectional studies Case-control study •Non-response –Volunteers not typical •Start with patients with disease –Population not typical –Can’t ascribe causality •Select control group of people who do •Example: A cross-sectional study revealed older not have disease people are shorter than younger •Possible explanations: •Compare risk factors in the two groups –People shrink when they get older –Younger generations are getting taller –Tall people more likely to die, so not present amongst old people!

Page 10 Case-control study Progress of case control study

• A group of subjects (cases) with the Exposed disease or condition of interest is Subjects with disease compared to a group of subjects (cases) Not exposed (controls) without the disease • The purpose of the comparison is to Subjects without disease Exposed determine whether, in the past, the cases (controls) have been exposed differently to a specific factor (or factors) than the Not exposed controls

Selection of controls Case-control study Advantages •Main principle –controls must be potential • Inclusion of a control group means that can cases control for confounding •Unmatched –select a random sample of • Relatively simple non-cases (egfrom same clinic but with • Usually less costly and time consuming different disease) than RCT or cohort study • Useful when the condition of interest is rare •Matched –select controls with similar known prognostic factors as cases

Case-control study Example: matched case-control study Disadvantages •All cases of testicular cancer in defined area • – especially in relation to the control over defined time group •Control: men in same hospital as cases, with 2 • Researcher bias – especially in relation to data years of age, same ethnic group with collection malignancy other than testicular cancer. • Ascertainment bias -historical data may be poorly recorded/illegible •Exposure: undescendedtestes at birth • Recall bias – case more likely to recall exposure •Conclusion: Men with undescendedtestes at • Cannot influence the duration/nature of exposure birth have a higher risk of testicular cancer to the intervention Brown et al (1987) •Confounding –other factors related to exposure and outcome

Page 11 Cohort study Cohort study • Subjects are identified and group according to whether or note they have been exposed to a • Because of the need to observe individuals specific factor over a period of time, cohort studies can be expensive and take a long time to complete • The groups are followed up over time to determine whether the of a • Usually unsuitable for studying rate outcomes particular disease is any greater (or less) in • Other problems include loss to follow-up and the exposed group than in the non-exposed the selection of subjects to follow up group • Both exposure and disease may have occurred at the time of the study (retrospective cohort study)

Cohort study Cohort studies Factors to consider in a cohort study: • Definition of the groups to be followed •Identify population who are exposed to • Method of follow-up risk factor, and do not have disease • Changing exposure over time • Definition of disease •Follow up over time • Duration of follow-up • Losses to follow-up • Sample size • Cost, difficulty, time to complete study

Structure of a cohort study Examples

Get disease Doll and Hill -smoking and lung cancer Exposed

Do not get disease People without disease Barker –birth weight and blood pressure in

Get disease middle age (retrospective cohort) Not exposed

Do not get disease

Page 12 Case-control vscohort Randomised controlled trial • on human beings –Case-control: cheap and quick –Cohort: expensive and slow • Control group, who do not receive the intervention

–Case-control: recall bias • Placebo – inert treatment –Cohort studies: drop-out bias • Hawthorne effect – experience of being in a study can influence behaviour/attitudes Cohort studies preferred and enable direct • Aim to quantify the effect of an intervention evaluation of relative risk •Best method of evaluating therapy, as all others have greater potential biases

Randomised controlled trial Randomised controlled trial

• Designed experiments in which the subjects are randomly allocated to treatment or intervention • To obtain an objective view of the efficacy of the groups, so that the allocation cannot be treatment it is desirable to have the patient predicted in advance ‘blinded’ to the treatment they are receiving • Randomisation tends to produce groups • To further increase objectivity it is also desirable comparable in unknown as well as known to have the physician ‘blinded’ to the treatment. factors likely to influence outcome apart from the • Studies in which both the patient and the treatment itself physician are blinded are called ‘double-blind’

ABC of trial design ABC of trial design

A:Allocation at random B:Blindness This must be done properly egshuffled If possible subjects should be blind to the envelopes, coin tossing or using a treatment they are receiving. May be computer. Use of date of birth or order of possible to use a placebo , which is arrival at clinic is not random identical in appearance to the active treatment. If blinding impossible, may be possible to blind evaluators

Page 13 ABC of trial design Types of randomised controlled trial: Parallel design C:Control group •Patients diagnosed and consent to enter Vitally important to have a study contemporaneous control group. Often •Patients randomised to different this is ‘usual care’. treatments using computer generated randomisation •Patients followed up over time

Parallel group randomized controlled Types of randomized controlled trial: trial cross-over trial • Example 2x2 cross-over trial Population • Patients are randomised to one Intervention treatment, and evaluated. (period 1)

Eligible, • There follows a -out period. Assessment Randomization Follow-up willing • Then randomised to the other treatment and evaluated (period 2) Controls

Two-period cross-over design Parallel group vscross-over Cross-over only suitable: – If there is a reasonably short period between treatment and Eligible, outcome. Willing Control Intervention Control – Condition is stable and reversible (i.e. treatment not expected to cure) Assessment •Cross-over trials used in chronic conditions (eg) Asthma, Randomization then Assessment arthritis washout •Cross-over trials can be more powerful than parallel groups since patient acts as own control control Intervention •Cross-over trials may suffer from the carry-over effect when treatment in period 1 affects outcome in period 2

Page 14 Pragmatic/explanatory trials Types of randomization

•An explanatory trial is one which seeks to answer the •Block randomization question ‘Does this treatment work (in the best circumstances)?’(i.e. everyone well diagnosed, • Patients randomised within blocks everyone takes treatment, no drop-outs) •Stratified randomisation •A pragmatic trial seeks to answer the question ‘Does this treatment work in practice?’(i.e. some people with wrong • Patients first split into major prognostic groups diagnosis, some people take wrong treatment.) and then randomised •A pragmatic trial leads to an ‘Intention to Treat’ policy i.e analyse by what a patient is randomised to, not what •Cluster randomisation they actually take. • Patients randomized in groups

Problems with randomized trials Why randomization? •In small studies may get imbalances ‘by chance’ Solution –stratify by important factors •Guarantees in the long run that patients in different groups are similar in both known • Not a random sample from the population, so problems of generalizability and unknown prognostic factors Subjects are volunteers •Ensures impossible to predict in advance •Ethics: who will get what treatment If treatment already believed to work If being in trial may harm the participant, but be for ‘greater good’ Stopping early

Problems with non-randomised Christie’s results studies

Cannot be sure change is due to intervention: CT scan No CT scan Example Christie(1979): in 1978 in 1978 A consecutive series of patients admitted with Pairs where 1978 19 (31%) 34 (38%) stroke to a hospital in 1974 followed up for survival. After a CT scanner was installed in better than 1974 1978 an equal number of patients were followed Pairs where 1978 2 (7%) 17 (19%) up matched by age, diagnosis and level of worse than 1974 consciousness. Some patients, although eligible, did not get a scan in 1978. The results given in Total 29 (100%) 89(100%) following table:

Page 15 Interpretation Systematic review/meta analysis • Review of the research evidence for a • If we only looked at those who got a particular treatment/condition/exposure prepared by applying a ‘systematic’ scan –v impressive approach to the review process • However those who didn’t get a scan in • Systematic reviews establish whether scientific findings are consistent and can 1978 also did better be generalised across populations, Explanation? settings and treatment variations or whether findings vary significantly by i) Treatment generally getting better particular subsets ii) Perhaps diagnosis different in 1978 to 1974

Systematic review/meta analysis Choosing a study design • Strict methods are used to limit bias and • The choice of appropriate design is not easy and improve reliability and accuracy of there are many things to consider conclusions • Often determined by resources, costs and the nature of the disease/exposure • Meta-analysis, where the results of • If possible, both ethically and logistically, it is individual studies are amalgamated preferable to carry out an experiment (quantitatively synthesised), can • The evaluation of alternative treatments is best increase the power and precision of addressed by a randomised controlled trial estimates of treatment effects and • RCTsare the most rigorous (single study) exposure design used in experimental research – especially if want to examine cause and effect

You should now know about: Summary • Types of study design commonly used in quantitative research •Design more important than analysis (can always re-do analysis, but it is hard to re-collect your data) You should now be able to: • Distinguish between different types of •Important when reading paper to decide quantitative study design and know when they on design at outset are appropriate • Distinguish between the strength of evidence provided by different study designs

Page 16 References: • Introduction to . Bailey et al. Open University Press • Design of studies for . D Questions? Machin& MJ Campbell. Wiley 2005

Page 17 Data display and summary

Displaying and summarising data What is Data? At the end of the session you should be able to: • Understand how to appropriately display data using a variety of charts, such as stem & leaf plots, Dr Jenny Freeman histograms, bar charts, box & whisker plots and dot Lecturer in Medical Statistics plots • Understand when it is appropriate to use particular summary measures: mean, median, mode, range, interquartile range, standard deviation • Understand elementary properties of the Normal distribution • Distinguish between positive and negative skew

Types of Data Exercise 1 Categorical (Qualitative) Quantitative (numerical) Classify the following data as quantitative (discrete / continuous) or qualitative (nominal / ordinal / binary): • Nominal (no natural • Count (can only take • Age certain values) ordering) • Marital status – Number of positive tests –Haemoglobin types for anaemia • Blood pressure –Sex – Number of children in a • Number of visits to GP per year family • Number of decayed, • Ordered categorical • Continuous (limited only by missing or filled teeth (ordinal) accuracy of instrument) • Cholesterol level – Anaemic / borderline / – Haemoglobin • Ethnic Group not anaemic concentration (g/dl) – Height – Grades of breast

Example Displaying categorical data • RCT of cost effectiveness of community leg ulcer clinics. •For categorical variables such as sex and blood group • 233 patients with venous leg ulcers randomly it is straightforward to present the number in each allocated to either usual care at home by district category or express it as a percentage of the total nursing team (control group, n=113)) or weekly number of patients. treatment with four layer bandaging in a specialist leg •Can use either a bar chart or a pie chart to display ulcer clinic (intervention group, n=120). these data graphically. • Outcomes of interest include relative costs for each group, time to complete ulcer healing, patient health • Always give sample sizes status, and recurrence of ulcers, satisfaction with care and use of services. • Avoid 3-D charts

Morrell C J, Walters S J, Dixon S, Collins K A, Brereton L M L, Peters J, • Only use pie charts when the number of categories is BrookerC G D. (1998) Cost effectiveness of community leg ulcer clinics: randomised controlled trial British Medical Journal 316 1487- low (< 5) 1491.

Page 18 Bar chart of marital status for the leg ulcer patients (n=233) Example of 2-D versus 3-D bar chart

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0

P e r c nt 0 P e r c nt Missing Married Single Div/Sep Widowed Missing Married Single Div/Sep Widowed 0 P e r c ent Missing Married Single Div/Sep Widowed Marital status Marital status Figure1: 2-D bar chart (recommended) Figure 2: 3-D bar chart (not recommended) Marital status

Example of 3-D bar chart, with patterns One final modification: order categories by largest first (definitely not recommended) 50 50

40 40

30 30 P e r c nt

20 20

10 10

0 0 P e r c nt Married Widowed Single Div/Sep Missing Missing Married Single Div/Sep Widowed Marital status

Stacked barchartshowing relationship between Pie chart showing marital status for the leg ulcer patients (n-233) maternal age and breastfeeding

Use of evidence based leaflets to promote informed choice in maternity care: randomised controlled trial in everyday practice. AO'Cathain, S JWalters, J PNicholl, K JThomas, MKirkham. BMJ 2002;324:643-647 n=233

Page 19 Displaying quantitative Heights of male leg ulcer data patients (n=76) • Stem & leaf plots 187 177 193 172 185 175 177 177170 165 172 177 177177177172 177 177 • Dotplots 190 177 172 182 185 182 172 182 177 182 170 157 172 172175 182 175 185 • Histograms 187 187187172 172177 187 180 167 170 170182 170 162 162185 177 177 • Box & whisker plots 180 180177 172 180 180177 180 175 177 177167 182 165 187 180 177 172 • Scatterplots 172175 170 180

Stem & leaf plot for heights of Dot plot of height for leg ulcer patients male leg ulcer patients Frequency Stem & Leaf

1.00 Extremes (=<1.57) 3.00 16 . 222 4.00 16 . 5577 18.00 17 . 000000222222222222 24.00 17 . 555557777777777777777777 15.00 18 . 000000002222222 10.00 18 . 5555777777 1.00 19 . 0 1.00 Extremes (>=1.93)

Stem width: 0.10 Each leaf: 1 case(s)

Histogram of height for the leg ulcer Histograms of height, by sex for patients, sexes combined (n=218) the leg ulcer patients • No spaces between 60 MMeen ( n(=n77)=76) WomWeonm e(n n(n=11445)2) bars –distinguish 50 50

from bar-chart. 50 • Use equal sized 40 40 40 intervals. 30 30 • Number of intervals 30 (bins) should be between 5 and 15, so 20 20 20 can display ‘shape’ 10 10 10 without ‘noise’.

0 F r eque n cy 0 0 • Always give sample 1.44 1.50 1.56 1.63 1.69 1.75 1.81 1.88 1.94 F r e q u n cy F r e q u n cy size. 1.43 1.50 1.57 1.63 1.70 1.77 1.83 1.90 1.97 1.43 1.50 1.57 1.63 1.70 1.77 1.83 1.90 1.97 Height in metres Height in metres Height in metres

Page 20 Table of systolic blood pressure levels for 16 patients from the obesity study before and after one of the study exercise sessions: Exercise 2 Subject Systolic blood pressure (mmHg) DIFFERENCE Number BEFORE AFTER After-Before 1 148 152 4 Display the systolic blood pressure of 2 142 152 10 3 136 134 -2 the sample of 16 study participants 4 134 148 14 5 138 144 6 before exercise using: 6 140 136 -4 7 132 144 12 8 144 150 6 9 128 146 18 (a) Stem & leaf plot 10 170 174 4 11 162 162 0 (b) Dot plot 12 150 162 12 13 138 146 8 14 154 156 2 15 126 132 6 16 116 126 10

Example of CONSORT style flow diagram of patient numbers Summarising Numerical Data • Graphs are a useful starting point as they give us a ‘feel’for the data and show how the data is distributed.

• Need to summarise data and we are often interested in: • What’s the average value? • What’s the spread of the data? Taken from ‘Costs and effectiveness of community postnatalsupportworkers: randomised controlled trial. • A measure of location (average) and variability Morrell et al, BMJ 2000; 321:593- (spread) provides an informative but brief summary 598’ of a set of observations.

Measures of location Measures of location •Need a numerical way of Mode Most common observation summarising the location / average Median Middle observation, when the data are value in a dataset arranged in order of increasing value •There are three main measures If have an even number of observations, e.g. if we take 50 − Mode results, the midpoint falls between the 25th and 26th, the median is calculated as the average of the two middle − Median observations. − Mean Mean Sum of all observations Number of observations

Page 21 Measures of location: Mode Measures of location: Median •Order the observations -median is the middle •The simplest measure of location is observation. the mode which is simply the most •Odd numbers of observations will have a common value observed: unique median. – e.g. for the BP data before exercisethe mode = 138 mmHg. •When there is an even number of observations there is strictly no middle observation -take the mean of the two middle observations.

Calculating the median for the Measures of location: Mean blood pressure data before exercise

BP mmHg Rank Given n observations -x1, x2,…..., xn • As the number of observations 116 1 n is even (n=16). 126 2 128 3 x 132 4 x +x ++... xor åi =1 i 134 5 1 2 n • The median is the average of x= x = 136 6 the two central values (the 8th 138 7 and 9th). 138 8 n n 140 9 142 10 • So the median blood pressure 144 11 148 12 148+142+136+...+116 2258 before exercise is 150 13 = =141.1mmHg (138+140)/2=139 mmHg. 154 13 16 16 162 15 170 16

Pros and cons of mean/ Exercise 3 median/ mode Calculate the following measures of •Median robust to outliers. location for the difference (after - •Median/mode reflects what ‘most’people before) blood pressure data: experience. •Mean uses all the data (more ‘efficient’). (a) mean •Mean is ‘expected’value. (b) median •Mean more common with statistical tests. (c) mode •Mode useful for grouped or categorical data.

Page 22 Measures of spread Quantifying Variability: Range minimum observation to maximum measures of spread observation •Need a numerical way of summarising the Interquartile range observation below which the amount of variability in a data set. bottom 25% of data lie and the observation above which the top 25% of data lie •Three main approaches to quantifying the NB: If value falls between two observations, eg if 25th centile falls variability: between 5th and 6th observations then the value is calculated as the average of the two observations (this is the same principle as for the – Range median). – Inter quartile Range Standard deviation (SD) Average distance of the – Standard deviation observations from the mean value ( NB: Variance = SD squared)

Range Inter quartile range •Split the data set into four equal parts - quartiles •Simplest way to describe the spread of a data Lower quartile (25th centile or Q1) set is to quote the minimum (lowest) and Median (50th centile or Q2) maximum (highest) value. Upper quartile (75th centile or Q3)

•Inter quartile range (IQR) tells you where the middle • e.g. The range for the BP data was 116 to 170 50% of your data lies. mmHg or as a single number 54mmHg IQR = upper quartile -lower quartile

•Affected by extreme values at each end of the •Graphical way of summarising data usingpercentiles data. is the box & whisker plot.

Box and whisker plot of height for Variance the leg ulcer patients •Based on the idea of averaging the distance each value is away from the mean x. • The box illustrates the interquartile range and thus contains the middle 50% of the •For an individual with an observed value xi the data. distance from the mean is xi -x. • The median is shown by the horizontal •With n observations we have a set of n such line across the box. differences, one for each individual. • The whiskers extend to the largest & •The sum of these distances, S(xi -x ) is always zero, smallest values excluding the outlying however if we square the distances before we sum values. The outlying values are those them we get a positive quantity. values more than 1.5 box lengths from •The average of these squared differences thus gives a the upper or lower edges. Those measure of deviation from the mean. observations between 1.5 and 3 box •This quantity is called the variance, and is defined as: lengths from upper or lower edges of the n box are outliers, whilst those more than 2 å ( x i - x ) 3 box lengths away are called extreme i = 1 values. n - 1 • Very useful when comparing several sets of data.

Page 23 Standard deviation Calculating the inter quartile range for the blood pressure data before exercise •The variance is not a suitable measure for describing variability because it is not in the same units as the raw data (We don’t want the variability of our set of • When the quartile lies BP mmHg Rank between 2 observations 116 1 blood pressure measurements expressed in square 126 2 easiest option is to take the mmHg). 128 3 mean (there are more 132 4 •The solution is to take the square root of the variance complicated methods). 134 5 136 6 -the standard deviation (usually abbreviated to SD 138 7 or s or s) defined as: • Lower quartile is 133 mmHg. 138 8 140 9 n 142 10 2 • Upper quartile is 149 mmHg. 144 11 å (xi - x ) 148 12 i =1 150 13 s = • IQR is 133 to 149 mmHg or as 154 13 n - 1 162 15 a single number 16mmHg. 170 16

Calculation of variance and standard deviation for blood pressure data before Exercise 4 exercise Blood PressureDifferencesSquare of Calculate the following measures of spread for Subject (mmHg)from Meandifferences from mean the difference (after-before) blood pressure 1 148 6.88 47.27 2 142 0.88 0.77 data (using the full sample of 16 subjects): 3 136 -5.13 26.27 4 134 -7.13 50.77 (a) range 5 138 -3.13 9.77 6 140 -1.13 1.27 (b) inter quartile range 7 132 -9.13 83.27 8 144 2.88 8.27 9 128 -13.13 172.27 10 170 28.88 833.77 11 162 20.88 435.77 Using only the first five subjects (1 to 5) 12 150 8.88 78.77 13 138 -3.13 9.77 calculate the standard deviation. 14 154 12.88 165.77 15 126 -15.13 228.77 16 116 -25.13 631.27 Totals (Sum) 2258 0 2783.75 Note: you will need to recalculate the mean blood Mean 141.13mmHg Variance 185.58 pressure after exercise for these five subjects. mmHg2 Standard Deviation 13.62 mmHg

Why use mean and SD? The Normal distribution • Bell shaped and •For many variables in health sciences symmetrical. • 68% of the – mean ± SD covers 68% of the distribution. observations lie within – mean ±2SDs covers about 95% of the 1 SD of the mean. • About 95% of the distribution. observations lie within •The mean ±2SDs is called the ‘normal 2 SDs of the mean. • Mean and medians range’or the reference range. will coincide.

Page 24 The Normal distribution example: Histogram of heights for 100 randomly Positively skewed distribution chosen men (with Normal curve added) M e d i an M ode M ean

Histogram of age, Histogram of weight, Negatively skewed distribution showing negative skew showing positive skew 50 (n=233) 50 (n=218)

40 40

30 30

20 20

10 10

0 0 F r equen cy F r eque n cy 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 45.9 52.8 59.7 66.6 73.4 80.3 87.2 94.1 100.9 107.8 114.7121.6 128.4135.3 142.2149.1 M ean M ode M ed i an Age (years) Weight in kilograms

Choosing appropriate Reference (normal) Ranges

summary measures Reference range gives limits within which we would •Choosing the most appropriate summary expect the majority of data for the population to lie. measures for a set of observations depends on –Often used in biochemistry the shape of their distribution. –5% of population outside normal range –Need ‘normal’ population –Child Growth reference charts are an example •If symmetrical use the mean and standard deviation. For Normally distributed data the reference range is given as the mean +/-1.96 SD and we would expect •If skewed the median and inter quartile range is approximately 95% of the data to lie between these two more appropriate, as they are less influenced by values. the extreme values.

Page 25 Reference ranges for non- Weight chart for breast-fed infants Normally distributed data Cole, Paul & Whitehead. •If data are not Normally distributed then we Weight reference can base the reference range on the observed charts for British percentiles of the sample (empirical longterm breast fed reference range). infants. ActaPaediatr i.e. 95% of the observed data lie between the 2002; 91: 2.5 and 97.5 percentiles. 1296-1300

•Most clinical reference ranges are based on samples larger than 500 people and usually on healthy subjects…………..

Summary Session recap • Always start by plotting your data to get a ‘feel’for You should now be able to: your data (see the shape of the distribution of the data). • Understand about different types of data • Displaydata using stem & leaf plots, histograms, • For numerical data a measure of location (average) bar charts and box & whisker plots and variability (spread) are of interest when • Calculate the summary measures: mean, summarising observations. median, mode, range, interquartile range, standard deviation • For categorical data present the number in each • Understand elementary properties of the Normal category or express it as a percentage of the total distribution number of patients. • Distinguish between positive and negative skew

Recommended Texts

Bland M (2000). An Introduction to Medical Statistics. 3rd Questions? Ed. Oxford Medical Publications. Hart A (2001) Making sense of Statistics in Health Care. RadcliffeMedical Press Petrie A and SabinC (2005) Medical Statistics at a Glance 2nd Ed. Oxford: Blackwell Campbell M J, MachinD & Walters S J (2007). Medical Statistics: A Textbook for the Health Sciences. John Wiley. SwinscowTDV & M J Campbell (2002). Statistics at Square One. 10th Ed. Blackwell BMJ Books.

Page 26 Formula for the mean Formula for the variance n The variance (usually abbreviated to var, s2, or s2) is defined as: x = å xi n 2 i = 1 ( å xi - x) n Var = s2 = i =1 x : mean (x-bar) S : Greek capital letter sigma for the summation symbol, sum values from i=1 to n n - 1 x : observation i i The units of variance are the original units squared e.g. n : number of observations g/dl2 for Haemoglobin. Therefore we usually use……

Formula for the standard deviation

The standard deviation (usually abbreviated to SD, s, or s) is defined as the square root of the variance:

n 2 å(xi -x) s = i =1 n -1

Page 27 Sampling with confidence

Last week welooked at: • Displaying data using bar charts, pie charts, SAMPLING WITH dot plots, stem & leaf plots, histograms and box CONFIDENCE & whisker plots • The summary measures: mean, median, mode, range, interquartilerange, standard Dr Jenny Freeman deviation • Elementary properties of the Normal distribution However, displaying data is not everything, you will also need to consider what it is that you are comparing ………

• During the last session we were Sampling With Confidence comparing a single observation to a At the end of the session you should be able to: reference group • Distinguish between a population and a sample • Define different methods of sampling • This sessionwe will be looking at • Calculate and understand what is meant by the samples and will want to see how term standard error (se) • Understand the concept of repeated sampling they compare to a reference group and its applicability to a single sample • Understand what is meant by the term confidence interval and how to interpret one

•It is rare that we look at the whole population Population (Census etc.) –all individuals in which we are interested •More usually we have samples and from these we calculate certain quantities, such as Sample the mean and standard deviation, which we then use to make inferences about the –group of individuals drawn from our population population of interest, which we study in •Quantities calculated for samples are known order to learn about the population as sample estimates, and they are used to estimate population quantities (parameters) The results from our sample are used as the best estimate of what’s true for the population (but the – Mean HbA1c in people with diabetes sample needs to be representative of the population) – Proportion of people with diabetes

Page 28 Population and Sample Methods of Sampling • Convenience sampling

Population parameter – All patients available at a particular point in time assumed random – Eglast 20 patients to consult about diabetes Population • Random sample Confidence – All members of the population are equally likely to Interval be picked, independently of each other Sampling mechanism: for population random sample or parameter E.g. patients picked at random from a list of all convenience sample – patients on GP register with diabetes • Stratified random sample Sample estimate – Population is divided into groups beforehand then Sample of population parameter randomly sampled within those groups – E.g. males and females or by age

Population Sample The Clinical Problem parameter statistic •Our doctor needs to do an audit of how the Mean μ x practice is performing with regard to diabetes Standard deviation σ s care Proportion π p •There are 300 patients with Type II Diabetes at the practice. With this many patients it would be time consuming to examine the notes for all of But what we want to know is how good is the them sample mean as an estimate of the true •So does our doctor need to measure all of the population mean? patients…… ……… …of course not!

The Clinical Problem cont… Random samples •We looked at the blood glucose results from all •Sampling can give an estimate of the true 300 patients with diabetes forthe purposes of values within a group this example •The larger the sample -the better the •We then took some random samples of the estimate results •Using confidence intervals, derived from •One sample (A) was just 20 of the 300 standard errors, we can quantify how good patients an estimate a sample result is likely to be •Another (B) was 50 patients •And a third (C) was 100 patients

Page 29 So how good is the sample mean as an Random samples estimate of the true population mean? •The average HbA1c (as a percentage of To answer this we need to assess the uncertainty total Hb) valuesfor the samples were: of our sample mean: • Different samples can give different estimates of –Sample A (20 patients) 7.4% the population mean –Sample B (50 patients) 6.48% • If we take repeated samples (of the same size) we get a spread of sample means which we can –Sample C (100 patients) 6.65% display visually in a dot plot (if the number is small enough), boxplotor histogram •But which of these best estimates the mean value for all 300 patients? • The variability (spread) of these samples means gives us an indication of the uncertainty of our •How certain –or uncertain –are we? single sample mean

Repeated randomsamples Properties of the distribution of Let’s redo each sample 50 times and see what results we get the sample means

7.6 •The mean of all the sample means will be the 7.4 same as the population mean. 7.2 •The standard deviation of all the sample 7.0 means (not individual values!) is known as the 6.8 STANDARD ERROR. H b A 1 c ( g / d l) 6.6 •The distribution of sample means will be 6.4 roughly Normal regardless of the distribution 6.2 of the variable, given a large enough sample

6.0 size - (Central limit theorem). 20 per sample 50 per sample 100 per sample

What is the Standard Error? What is the Standard Error? § In practice we cannot repeatedly sample from the § As we saw in the previous figure: as the sample size population increase the approximation of the sample to the § What we want to know is how likely are we (in our population improves i.e. the spread of the sample means single sample) to have captured what is going on in gets smaller the population § Thus, all other things being equal, we would expect § And so, when we have only one sample we calculate estimates to get more precise and the value of the se to the STANDARD ERROR decrease as the sample size increases. § The standard error (se) is an estimate of the precision of the population parameter estimate that doesn’t S require lots of repeated samples. It is used to help standard error = determine how far from the true value (population n parameter) the sample estimate is likely to be.

Page 30 Standard Error (of the difference Standard Error (of the difference between two sample means) between two proportions) Comparison of two independent samples is If we have two observed proportions, p1 and p2, common, and for this we need to know the from two independent samples, then the SE of Standard Error (of the difference between the their difference, p1 -p2, is given by: two sample means)

2 2 p (1- p ) p (1- p ) s s se(p - p ) = 1 1 + 2 2 se x - x = 1 + 2 1 2 ( 1 2 ) n1 n2 n1 n2

Random samples Standard Deviation versus Standard Error? • The standard deviation quantifies the spread of Sample A Sample B Sample C individuals • The standard error quantifies the spread of the mean Size 20 50 100 –The standard error is sometimes called the standard deviation of the mean Mean 7.4 6.48 6.65 • Standard deviation is for description and describes the variability of the data • Standard error is for estimation and describes the SD 1.30 1.24 1.32 precision of the mean Standard DEVIATION – DESCRIBING SE 0.29 0.18 0.13 Standard ERROR – ESTIMATING

Confidence Intervals Properties of the Normal Distribution • The sample mean is the best estimate we have of the true • Bell shaped and population mean symmetrical • Any position along the horizontal axis can be • However, we need to assess how good the sample mean is expressed as a number of as an estimate of the true population mean? SD away from the mean. Area = 1 • The mean and median Þ Standard Error (measure of the precision of an estimate will coincide. of a population parameter) • About 68% of the observations will lie within 1 SD of the mean • The distribution of sample means of many samples of the • More importantly, about same size will be approximately Normally distributed 95% of the observations will lie within regardless of the distribution of the variable – the Central approximately 2 SDsof Limit Theorem the mean (and conversely 5% of observations lie more than 2 SDsaway • From this we can use the sample mean and its standard from the mean) error to construct a confidence interval – a range of values within which the population mean is likely to lie

Page 31 5% of observations lie outside the mean ± 1.96SD or p = 0.05 Confidence Intervals

• Confidence intervals give limits in which we are confident (in terms of probability) that the true population parameter lies

• Describe the variability surrounding the sample point estimate

• In general, they depend upon making assumptions about the data

Confidence Intervals 95% Confidence Interval A range of values which will include the true • For example a 95% CI means that if you could population mean with probability 0.95 sample an infinite number of times 95% Confidence Interval for the mean – 95% of the time the CI would contain the true population parameter x ±1.96´SE(of the mean) – 5% of the time the CI would fail to contain the true population parameter 95% Confidence Interval for the difference in • Alternatively: it gives a range of values that will two means include the true population value for 95% of all (x - x )±1.96´SE (of the difference in the means) possible samples 1 2 (n.b. 1.96 is often rounded to 2)

Example: CI for the difference Example: CI for the difference between two population means between two population means Blood pressure levels were measured in 100 diabetic and 100 non-diabetic men aged 40-49 years. The mean systolic blood pressures were 146.4 mmHg (SD 18.5) The standard error of the difference between among the diabetics and 140.4 mmHg (SD16.8) among the two independent sample means is: the non-diabetics. s 2 s 2 SE ( x - x ) = 1 + 2 The difference between the two independent sample 1 2 n n means is: 1 2

2 2 (meandiabetics-meannon-diabetics) = 146.4 -140.4 = 6.0 mmHg SE(diff) = Ö{(18.5 /100) + (16.8 /100)} = 2.50 mmHg

Page 32 Example: CI for the difference Example: Confidence intervals for a between two population means difference between two proportions Response to treatment was assessed among 160 The 95% CI for the population difference in the two patients randomised to either treatment A or treatment population means is then given by: B; the results are shown below: Treatment 6.0 -(1.96 x 2.50) to 6.0 + (1.96 x 2.50) Response A B 1.1to 10.9 mmHg Improvement 61 45 No improvement 19 35 Therefore we are 95% confident that the true population mean difference in systolic blood pressure between Total 80 80 diabetics and non-diabetics lies somewhere between 1.1 to 10.9 mmHg, but our best estimate is 6.0 mmHg. The proportion of patients who improved was: On treatment A- pA= 61/80= 0.76 On treatment B- p= 45/80 = 0.56 B

Example: Confidence intervals for a Example: Confidence intervals for a difference between two proportions difference between two proportions The difference in the proportion of patients who The 95% CI for the difference between two improved on treatment A & B was: population proportions is given by: p -p= 0.76 -0.56 = 0.20 A B 0.20 -(1.96 x 0.073) to 0.20 + (1.96 x 0.073) 0.06 to 0.34. The standard error (of the difference in two proportions) was: Therefore we are 95% confident that the true 0.76(1-0.76) 0.56(1-0.56) population difference in the proportions who SE(p - p ) = + =0.073 improve on treatment A and B lies somewhere A B 80 80 been 0.06 to 0.34, but our best estimate of this difference is 0.2.

Confidence Intervals for blood pressure constructed from 100 random samples of size 16 Confidence Intervals for samples A, B, C

Sample A Sample B Sample C Size 20 50 100 Mean 7.4% 6.48% 6.65% SD 1.30 1.24 1.32 SE 0.29 0.18 0.13

Seven do not include 141.1mmHg -we would expect that the 95% 95% CI 6.82 to 7.98 6.12 to 6.84 6.39 to 6.91 CI will not include the true population mean 5% of the time

Page 33 Confidence Interval or Reference Range? Mean HbA1c values for the three groups, together • During the last session we looked at creating with 95% confidence interval for the mean value reference ranges from a ‘population’ of blood results 8.5 • This was constructed by describing a range between two SDsbelow the mean and two SDsabove the 8.0 mean (mean ±twicethe SD) Cut-off for good control • A confidence interval describes the precision of a 7.5 sample estimate and is constructed for the mean by describing a range between the:

H b A 1 c ( %) 7.0 (mean – twice the SE) and the (mean + the SE) 6.5 Standard DEVIATION – DESCRIBING

Standard ERROR – ESTIMATING 6.0 20 per sample 50 per sample 100 per sample

Mean HbA1c values for the three groups, together Confidence Intervals with 95% confidence interval for the mean value • For the group of size 20, though the estimate is about 8.5 7.4, it could be as low as 6.8 or as high as 7.9 • Whereas, for the group of size 100, the limits are much 8.0 closer, so that though the best estimate is 6.6, the range of plausible values is between 6.4 and 6.9 i.e. much Cut-off for good control closer range than for the smaller sample 7.5 • So, which would we chose? – It’s clear from the graph that the confidence interval for the H b A 1 c ( %) 7.0 sample of size 20 does not include the true mean and so this would not be a good choice (though in practice we don’t know this) 6.5 True population mean – For our purposes 50 would probably have been adequate as the confidence interval is smaller and does not include the cut-off value of 7 (just by chance!) 6.0 20 per sample 50 per sample 100 per sample – The size of sample that is chosen depends upon many things, including the purpose of the investigation, and how accurately you want to estimate the quantity of interest

The Clinical Problem The Clinical Problem •Our doctor needed to do an audit of how well the •Using confidence intervals, derived from practice was performing with regard to diabetes standard errors, it is possible to quantify how care good an estimate a sample result is likely to be •The larger the sample, the better the estimate •With 300 patients it would have been time consuming to examine the notes for all of them •Examination of 20 patients’notes suggested on average the patients with diabetics were poorly •Thus, rather than look at the entire population, a controlled sample could be taken to get an estimate of the •But in fact-if she could have tested all 300 she true value in the population would have realised they were actually doing very well

Page 34 Summary (1) Summary (2) •Can use samples to make estimates of •Can estimate the precision of the sample estimates population quantities: the results from our using the standard error. •However we don’t use the standard error on its own sample provide the best estimate of what’s we use it to construct a confidence interval. true for the population. •Can use confidence intervals to say how confident •Different methods of sampling include we are about our sample estimate. A confidence random sampling, stratified random sampling interval is a range of values in which we are confident and cluster sampling. Best to use random the true population mean/proportion will lie sampling so as to minimise bias •The most common confidence interval is the 95% CI. We would only expect our range not to include the •When we use the sample mean/ proportion true population mean/proportion 5% of the time we need to consider how good an estimate it is of the true population mean/ proportion

Sampling With Confidence You should now be able to:

• Distinguish between a population and a sample Questions? • Define different methods of sampling

• Calculate and understand what is meant by the term standard error (se)

• Understand the concept of repeated sampling and its applicability to a single sample

• Understand what is meant by the term confidence interval and how to interpret one

Page 35 Estimation and hypothesis testing

At the end of session, you should know about: ESTIMATION • The process of setting and testing statistical hypotheses & At the end of session, you should be able to: HYPOTHESIS • Explain: Null hypothesis TESTING P-value, and what different values mean Type I error Type II error Dr Jenny Freeman • Understand what is meant by the term Power • Demonstrate awareness that the p-value does not give the probability of the null hypothesis being true • Demonstrate awareness that p>0.05 does not mean that we accept the null hypothesis • Distinguish between ‘statistical significance’ and ‘clinical significance ’

Statistical Analysis (1) Statistical Analysis (2): • Previously we discussed why we take samples Population and Sample rather than study the whole population –We examine the behaviour of a sample as it is often Population parameter not feasible to look at the entire population

•From a sample we want to make inferences Population about the population from which it is drawn. Confidence –We do this by a process of statistical hypothesis Interval testing: formulating a hypothesis and testing it Sampling mechanism: for population random sample or parameter convenience sample •This session we will look at how you formulate and test a hypothesis.

–You are not expected to know about individual tests, Sample estimate but need to understand the concept of setting and Sample of population testing statistical hypotheses parameter

Hypothesis testing: the main steps Statistical Analysis (3) Set null hypothesis Set study (alternative) hypothesis • The main aim of statistical analysis is to use the information gained from a Carry out significance test sample of individuals to make inferences about the population of Obtain test statistic interest Compare test statistic to hypothesized critical value • There are two basic approaches to statistical analysis Obtain p-value – Estimation (confidence intervals) – Hypothesis testing (p-values) Make a decision

Page 36 State your hypotheses (H0 & H1 ) Asthma Example* • State your null hypothesis (H ) 0 •A recent randomised controlled trial of 259 (statement you are looking for evidence to disprove) patients with chronic stable asthma, treated at • State your study (alternative) hypothesis (H or H ) 1 A high dose of inhaled corticosteriods • Often statistical analyses involve comparisons between different treatments (egstandard and new) •Randomised to either control (no change in – however we assume the treatment effects are equal until proven otherwise – we test to see how therapy) or stepdowngroup (50% reduction of likely is the effect that we have obtained if there is therapy) truly is no difference between groups • Therefore the null hypothesis is usually the negation •After one year the patients were compared for of the research hypothesis – new treatment will differ differences in their asthma related events, health in effect from the standard treatment (i.e.research hypothesis) status and corticosterioddosage

*Taken from ‘Stepping down inhaled coricosteriodsin asthma: NB:It is easier to disprove things than prove them randomised controlled trial’. Hawkins et al, BMJ, 326, 1115-1120

Asthma example •What is the research question? Carry out significance test • Calculate a test statistic using your data (reduce your data down to a single value). Exactly how this is done will vary •What is the outcome variable? depending upon the test you use (e.g. t- test, chi-squared test) •What is the null hypothesis? • Compare this test statistic to a hypothesized critical value (using a distribution we expect if the null hypothesis •What is the alternative is true (e.g. t distribution, chi-squared hypothesis? distribution)) to obtain a p-value

Asthma Example: results (1) Asthma Example: results (2) Figure: Results for asthma exacerbations and asthma related events data Figure: Percentage difference in between the two groups (stepdown - control) together wtih confidence interval for the difference 40 20 control step-down 15 30

10

20 5 P e r c en t age

0

10 P e r c n t a g d i f ce

-5

0 -10 exacerbations GP visits GP home visits Exacerbations GP visits GP home visits

Page 37 Asthma example: results (5) Asthma example: results (4) Table: Number of asthma exacerbations and asthma related events In addition to the asthma events data, the researchers in the study groups. Values are numbers (%) of patients also examined whether there were difference in the

Control Stepdown Chi-squared P-value for corticosteroid dosages between the two groups at the (n=129) (n=130) test statistic* result end of the study. Asthma exacerbations 33 (26) 40 (31) 0.624 0.430 Table: Mean daily dose of oral corticosteriod(sd) by group

Asthma related events: Control Stepdown Mean difference Visit to GP 41 (32) 45 (35) 0.124 0.725 (n=129) (n=130) (95% CI) Home visit by GP 6 (5) 3 (2) 0.477 0.490 Daily inhaled *: continuity corrected on 1 d.f corticosterioddose (μg) 1415 (631) 1067 (518) 348 (202 to 494) A significance test of the difference between the two groups with respect to asthma exacerbations gave a p-value of 0.430, whilst for A significance test of the difference between the two groups with visit to GP and home visits the p-values were 0.725 and 0.490 respect to daily dosage gave a p-value of less than 0.001 respectively

Making a decision (1) Making a decision (2) The probability of rejecting the null hypothesis when it is actually false is called the •When making a decision you can either decide to POWER of the study (Power=1-β). It is the probability of concluding that there is a reject the null hypothesis or not reject the null difference, when a difference truly exists

hypothesis. The null hypothesis is actually:

False True (i.e. there actually is a difference in (i.e. there actually is no difference •Whatever you decide, you may have chosen correctly the population) in the population) You decide to: and: rejected the null hypothesis, when in fact it is false Reject the null Correct False positive / ­ hypothesis type I error / α (i.e. conclude it is false √ and that there is a ­ not rejected the null hypothesis, when in fact it is true difference) X

• Or you may have chosen incorrectly and: Not reject the False negative / null hypothesis Correct type II error / β rejected the null hypothesis, when in fact it is true (i.e. conclude it is not ­ false and that there is √ no difference) X (false positive) A p-value is the probability of obtaining your results or results more extreme, if ­ not rejected the null hypothesis, when in fact it is false (false negative) the null hypothesis is true. It is the probability of committing a false positive error i.e. of rejecting the null hypothesis when in fact it is true

Making a decision (3) Statistical significance (1) •Use your p-value to make a decision about whether We say that our results are statistically significant if the to reject, or not reject your null hypothesis p-value is less than the significance level (a) set at 5% P-value P ≤ 0.05 P > 0.05 Small Large Result is Statistically significant Not statistically significant Your results are unlikely Your results are likely when when the null hypothesis is the null hypothesis is true Decide That there is sufficient That there is insufficient true evidence to reject the null evidence to reject the null hypothesis and accept the hypothesis •A p-value can range from 0 to 1 alternative hypothesis • But how small is small? The significance level is usually set at 0.05. Thus if the p-value is less than We cannot say that the null hypothesis is true, only that there is not enough evidence to reject it this value we reject the null hypothesis

Page 38 Statistical significance (2) Misinterpretation of P-values (1) • The significance level is usually set at • A common misinterpretation of the P-value is 5% that it is: –The probability of the data having arisen by chance • The level is conventional rather than –The probability that the observed effect is not a real fixed one

• Sometimes, for stronger proof we • The distinction between this incorrect definition require a significance level of 1% (or and the true definition is the absence of the P<0.01) phrase when the null hypothesis is true

Misinterpretation of P-values (2) Asthma Example: making a decision • The omission of when the null hypothesis is true leads to the incorrect belief that it is •A p-value is the probability of obtaining your results or results more extreme, if the null hypothesis is true possible to evaluate the probability of the observed effect being a real one •For the asthma related events data the p-values were as follows: • The observed effect in the sample is genuine, Asthma exacerbation 0.354 Visits to GP 0.629 but we do not know what is true in the GP home visits 0.304 population •We conclude that there is insufficient evidence to reject the • All we can do with this approach to statistical null hypothesis of no difference between the groups. The results are not statistically significant at the 5% level analysis is to calculate the probability of observing our data (or data more extreme) •We do not conclude that the null hypothesis is true, only that there is insufficient evidence to reject it when the null hypothesis is true

Recap: making a decision Set null hypothesis Asthma Example: making a decision Set study (alternative) hypothesis •For the daily dosage data the p-value was as follows: Carry out significance test Daily corticosterioddosage < 0.001

•We reject the null hypothesis and conclude Obtain test statistic that there is a difference between the groups with respect to daily dosage. The result is Compare test statistic to hypothesized statistically significant at the 5% level critical value

Obtain p-value

Make a decision

Page 39 Limitations of a hypothesis test Statistical & Clinical Significance (1) • All that we know from a hypothesis test is how likely • A clinically significant difference is one that the difference we observed is given that the null is big enough to make a worthwhile hypothesis is true difference • The results of a significance test do not tell us what • Statistical significance does not necessarily the difference is or how large the difference is mean the result is clinically significant • To answer this we need to supplement the • Supplementing the hypothesis test with an hypothesis test with a confidence interval which will estimate of the effect with a confidence give us a range of values in which we are confident interval will indicate the magnitude of the the true population mean difference will lie result. This will help the investigators to decide whether the difference is of interest clinically

Statistical & Clinical Significance (2) Statistical & Clinical Significance (3) 95% Confidence intervals added

Study A: Not statistically significant (p >= 0.05) Study A: Not statistically significant and not clinically significant (p >= 0.05)

Study B: Statistically significant (p < 0.05) Study B: Statistically significant but not

clinically significant (p < 0.05)

Study C: Not statistically significant (p >= 0.05) Study C: Not statistically significant but (possibly) clinically significant (p >= 0.05)

Study D: Statistically significant (p < 0.05) Study D: Statistically significant and (possibly) clinically significant (p < 0.05)

Study E: Statistically significant (p < 0.05) Study E: Statistically significant and clinically significant (p < 0.05)

Null Clinically important difference difference Null Clinically important difference difference

Statistical & Clinical Significance (4) Relationship between confidence intervals and statistical significance (1) • With a large enough sample the smallest of changes may be statistically significant but not clinically • There is a close relationship between important. hypothesis testing and confidence intervals • If the sample size of the study is too small and has low power, a clinically significant result may not be If the 95% CI does not include zero (or more regarded as statistically significant. • generally the value specified in the null • Therefore it is important that the size of the sample is adequate to detect the clinically significant result, hypothesis) then a hypothesis test will return a at the 5% significance level with at least 80% power statistically significant result (something to look for in the methods section when reading the literature). • If the 95% CI does include zero then the hypothesis test will return a non-significant result

Page 40 Relationship between confidence Relationship between confidence intervals and statistical significance (2) intervals and statistical significance (3) • 95% certain that the CI includes the true value • The CI shows the most likely size of the difference Þ Thus there is a 5% probably that the true value lies given the data and the uncertainty or lack of precision outside the CI around this difference. The p-value alone tells you Þ If the CI does not include zero there is a less than 5% probability that the true vale is zero nothing about the size nor its precision. Thus the CI conveys more useful information than p-values alone • The p-value represents the probability that you conclude there is a difference when in fact there –egwhether a clinician will use a new treatment that is no difference reduces blood pressure will depend on the amount of that reduction and how consistent the effect is Þ Thus when p=0.05 there is a 5% probability that we conclude there is a difference when in fact there is no • So, the presentation of both the p-value and the difference i.e. there is 5% probability that the true value is zero confidence interval is desirable

Summary You should now know about: • The process of setting and testing statistical hypotheses • Research questions need to be turned into a statement for which we can find evidence to disprove -the null hypothesis. You should now be able to: • Explain: • The study data is reduced down to a single probability -the probability of observing our result, or one more extreme, if the Null hypothesis null hypothesis is true (P-value). P-value Type I error • We use this P-value to decide whether to reject or not reject the Type II error null hypothesis. Power • Demonstrate awareness that the p-value does not give • But we need to remember that ‘statistical significance’ does not the probability of the null hypothesis being true necessarily mean ‘clinical significance’. • Demonstrate awareness that p>0.05 does not mean that we accept the null hypothesis • Confidence intervals should always be quoted with a hypothesis test to give the magnitude and precision of the • Distinguish between ‘statistical significance’ and ‘clinical difference. significance’

One-sided vstwo-sided significance testing • Two-sided :does not specify the Questions? direction of any effect –There is a difference between treatment A and treatment B

• One-sided :specifies the direction of the effect – Treatment A is better than treatment B

Page 41

One-sided significance testing • One-sided tests are rarely appropriate, even when there is a strong prior belief as to the direction of the effect, as by doing a one-sided test you do not allow for the possibility of finding an effect in the opposite direction to the one you are testing • This is similar to history taking, when it is important not to ask leading questions in case you miss the correct diagnosis

• The decision to do one-sided tests must be made before the data are analysed; it must not depend on the outcome of the study

• An example of when a one-sided test might be appropriate is in clinical trials looking at bioequivalence

Page 42 Living with risk

At the end of session, should know about: • Measures of risk LIVING WITH RISK • Different ways to describe risk At the end of session, should be able to: • Explain: − Risk − Relative risk Dr Jenny Freeman − Odds and odds ratios − Absolute risk reduction and risk excess − Number needed to treat • Understand that odds ratios are sometimes used to describe risk • Be familiar with the concept of risk ladders • Demonstrate awareness that presenting the same risk in different ways may affect how patients (and doctors) perceive risk

• There is now a ‘language of risk’ for both Terminology health professionals • There are several terms used to describe ‘risk’ : risk / probability / chance • You need to be educated about risk and understand what is meant by different measures of risk • These are the same, but each has an implied meaning e.g. we talk about the chance of winning the lottery, not the risk of • You will need to be able to communicate winning, but what we really mean is the the concept of risk effectively to patients probability of winning

Risk The clinical problem The risk of an event is the probability that an event will Details of the fax….. occur within a stated time period (P). This is sometimes referred to as the absolute risk. “This fax is to inform you of the withdrawal yesterday of Cerivastatin(Baycol/ Lipobay) by Bayer. You will be Examples receiving more information directly from Bayer in the The risk of developing anaemia during pregnancy for a near future” particular group of pregnant women would be the number of women who develop anaemia during pregnancy divided by the total number of pregnant women in the group. • Bayer announced the withdrawal of all dosages of the preparation with immediate effect The risk of a further stroke occurring in the year following an • The FDA reported that the rate of fatal rhabdomyolysis initial stroke would be the number who have another stroke was 16-80 times more frequent for Baycolcompared to within a year divided by the total number of stroke patients any other statin being followed up.

Page 43 The clinical problem About statins “The reason for this voluntary action lies in •Cerivastatinwas one of the more increasing reports of involving muscular weakness (rhabdomyolysis), especially in commonly used anti-cholesterol drugs patients who have been treated concurrently with •Many trials showed a link between high the active substance gemfibrozildespite a cholesterol and increased risk of heart contraindication and warnings contained in the attacks product information.” •Lowering cholesterol can lower the risk of heart attacks… Quote taken from the Bayer website: www.news.bayer.com/news/news.nsf/ID/01-0219

Absolute Risk Relative Risk (RR) Ratio of risk in exposed group to risk in not exposed group (P /P ) The risk of an event is the exposed unexposed Pregnancy example probability that an event will Relative risk of anaemia for pregnant women compared to non-pregnant women of a similar age = risk of developing occur within a stated time anaemia for pregnant women divided by the risk of developing anaemia for non-pregnant women of a similar period (P). This is sometimes age. Stroke example referred to as the absolute Relative risk of further stroke for patients who have had a stroke compared to patients who have not had a stroke = risk. risk of a stroke within one year post stroke divided by the risk of having a stroke in a year for a similar group of patients who have not had a stroke.

Example 1 Example 2

Below are the results of a clinical trial examining whether patients with chronic A group of subjects are followed up over time. Some are exposed fatigue syndrome (CFS) improved six weeks after treatment with intramuscular to a hazard, some are not. In both groups some will develop a magnesium. The group who received the magnesium were compared toa group who disease (an event), some will not. received a placebo and outcome was feeling better Exposed Not exposed Magnesium Placebo

Number who a b Felt better 12 3 develop disease Did not feel better 3 14 not develop disease Number who do c d not develop disease Total 15 17

Total a+c b+d ‘Risk’of improvement on magnesium = 12/ 15 = 0.80 ‘Risk’of improvement on placebo = 3/ 17 = 0.18 Risk of developing disease for exposed = a / a+c Risk of developing disease for unexposed = b / b+d Relative risk = 0.80/0.18 = 4.5 Relative risk of developing disease for the exposed compared (of improvement on magnesium therapy compared to placebo) to the unexposed = {a /(a+c)}/ {b /(b+d)} = a(b+d) / b(a+c) Thus patients on magnesium therapy are 4 times more likely to feel better on magnesium rather than placebo

Page 44 Clinical problem

The clinical problem For our example if we imagine 1,000,000 prescriptions for Baycol In the same report, the FDA said Baycol Other statins Number who die 2 1 from rhabdomyolysis “The reporting rate for fatal rhabdomyolysiswith Baycolmonotherapy(1.9 deaths per million Number alive or died 999,998 9,999,999 prescriptions) was 10 to 50 times higher than for from other causes other statins” Total 1,000,000 10,000,000 ‘Risk’of dying on Baycol = 2 / 1 000 000 ‘Risk’of dying on other statins = 1 / 10 000 000 So we’re talking about a risk of 1.9 deaths per Relative risk of dying from rhabdomyolysisfor the Baycolpatients MILLION PRESCRIPTIONS… compared to other patients on statins = (2/1 000 000) / (1/10 000 000) = 20.5

Other statinshave a risk of a 50th of this i.e. Patient on Baycolis 20 times more likely to die from i.e. 4 per 100 000 000 prescriptions rhabdomyolysiscompared to patient on other statins

Points to consider The clinical problem • As with many estimated quantities it is possible to So this quote: calculate a confidence interval for relative risk. • For the Baycolexample the 95% confidence interval “The FDA reports that the rate of fatal rhabdomyolysis for the relative risk is (16-80). This means that though is 16-80 times more frequent for Baycolas compared we estimate that patients are 20 times more likely to to any other statin.” die of rhabdomyolysison Baycolthan other statins, it is possible that this relative risk could be as low as 16 times or as high as 80 times, with 95% certainty. is referring to a relative risk comparing Baycolto other similar drugs.

But why is it written as a range?

Points to consider Issues with RR –defining success • What does a relative risk of 1 mean? Treatment A Treatment B − That there is no difference in risk in the two groups. Success 0.96 0.99 − For our magnesium example, if the Failure 0.04 0.01 relative risk was 1, it would mean that patients are as likely to feel better on If the outcome of interest is success then RR=0.96/0.99=0.97

magnesium as on placebo If the outcome of interest is failure then RR=0.04/0.01=4 − If there was no difference between the groups the confidence interval would Always consider all the risks include 1

Page 45 • There are several ways of comparing risk Absolute It is the absolute additional risk of an event due to an • So far we have talked about absolute and exposure. Risk in exposed group minus risk in unexposed (or relative risk i.e. the risk of an event differently exposed group). occurring in one group relative to the risk Absolute risk reduction (ARR) = Pexposed -Punexposed of it occurring in another If the absolute risk is increased by an exposure we sometimes use the term Absolute risk excess (ARE) • When considering a particular ‘risk’ it is So the absolute risk (of dying from Rhabdomyolysis) in important to know whether relative or patients on Baycolwas 2 in 1 000 000. This is 0.000002 absolute risk is being presented as this influences the way in which it is In patients using other statinsthe absolute risk (of dying from interpreted rhabdomyolysis–and not something else) was 1 in 10 000 000 This is 0.0000001 Thus the ARE is 0.000002 – 0.0000001 = 0.0000019

0.0000019 Example From the previous example of comparing magnesium therapy Since this is such a small number and hard to and placebo: deal with we would often multiply this by a larger number (say a million) and present it as Magnesuim Placebo the number per million Felt better 12 3 Did not feel 3 14 In this case 0.0000019 is the same as saying better An absolute risk excess of 1.9 deaths per Total 15 17 million (Prescriptions of statins) for Baycol compared to other statins ‘Risk’of improvement on magnesium = 12/15 = 0.80 ‘Risk’of improvement on placebo = 3/17 = 0.18

Thus in absolute terms this risk is very small Absolute risk reduction = 0.80 -0.18 = 0.62

Number Needed to Treat (to benefit) / Number Needed to Treat to Harm Example From the previous example of comparing magnesium therapy and placebo: • This is the additional number of people you would need to Magnesuim Placebo give a new treatment to in order to cure one extra person compared to the old treatment. Felt better 12 3 Did not feel 3 14 • Alternatively for a harmful exposure, the number needed to better treat becomes the and it is the Total 15 17 additional number of individuals who need to be exposed to the risk in order to have one extra person develop the ‘Risk’of improvement on magnesium = 12/15 = 0.80 disease, compared to the unexposed group. ‘Risk’of improvement on placebo = 3/17 = 0.18

Number needed to treat =1 / ARR Absolute risk reduction = 0.80 -0.18 = 0.62 Number needed to treat (to benefit) = 1 / 0.62 = 1.62 ~2 Number needed to harm =1 / ARR, ignoring negative sign. Thus on average one would have to give magnesium to 2 patients in order to expect one extra patient (compared to placebo) to feel better

Page 46 0.0000019 Issues with NNT –always consider all the risks An absolute risk excess of 1.9 deaths per P P P -P NNT rr million (Prescriptions of statins) for Baycol A B B A compared to other statins 0.0001 0.1001 0.10 10 1001 0.0010 0.1010 0.10 10 101 The number needed to treat to harm would be 0.0100 0.1100 0.10 10 11 1/ 0.0000019 = 526 315 0.0500 0.1500 0.10 10 3 So to cause one additional death from 0.1000 0.2000 0.10 10 2 rhabdomyolysiswith Baycolwe would need to 0.2500 0.3500 0.10 10 1.4 make out over half a million prescriptions for Baycol 0.5000 0.6000 0.10 10 1.2

Odds and Odds Ratio (1) Odds and Odds Ratio (2)

ODDS Exposed Not exposed The odds of an event is the ratio of the probability of Number with disease a b occurrence of the event to the probability of non- Number without disease c d occurrence. Total a+c b+d P

(1 -P) Odds of developing disease for exposed = {a/(a+c)} / {c/(a+c)} = a/c ODDS RATIO (OR) Odds of developing disease for not exposed = {b/(b+d)} / {d/(b+d)} = b/d Ratio of odds for exposed group to the odds for not Odds ratio of developing disease for the exposed exposed group. compared to the unexposed = {a/c} / {b/d}= ad / bc

{Pexposed/(1 -Pexposed)}

{Punexposed/(1 -Punexposed)}

Example From the previous example of comparing magnesium therapy and placebo: Relative Risk and Odds Ratio

Magnesuim Placebo • The odds ratio can be interpreted as a Felt better 12 3 relative risk when an event is rare and Did not feel better 3 14 the two are often quoted Total 15 17 interchangeably

Odds of improvement on magnesium = 0.80 / (1 -0.80) = 4.0 Relative risk = a(b+d) / b(a+c) Odds of improvement on placebo = 0.18 / (1 -0.18) = 0.21 Odds ratio = ad / bc Odds ratio = 4.0 / 0.21 = 19.0 (of magnesium compared to placebo) • This is because when the event is rare (b+d)→ d and (a+c)→c, thus the relative Thus, for those who improved, they were 19 times more likely to be in the magnesium group than the placebo group (95% CI: 3.2 to 110.3). risk → odds ratio Compare with relative risk

Page 47 Relative Risk and Odds Ratio The clinical problem • For case-control studies it is not possible to calculate the RR and thus the odds ratio is • Although Baycolseemed to be more used. ‘risky’than other statinsit doesn’t seem • For cross-sectional studies both can be very risky! derived and if it is not clear which is the causal variable and which is the outcome should use the odds ratio as it is symmetrical, in that it gives the same answer if the causal • How does a risk of 1 in 500,000 compare and outcome variables are swapped. with other risks? • Odds ratio have mathematical properties which makes them more often quoted for formal statistical analyses

CalmanChart (BMJ, 1996): CalmanChart (BMJ, 1996): Risk per year Risk per year Term used Risk range Example Actual risk Term used Risk range Example Actual risk

High >1:100 Transmission to susceptible 1:1 to 1:2 High >1:100 Transmission to susceptible 1:1 to 1:2 household contacts of measles* household contacts of measles*

Moderate 1:100-1:1000 Death because of smoking 10 Moderate 1:100-1:1000 Death because of smoking 10 cigarettes per day 1:200 cigarettes per day 1:200

Low 1:1000-1:10,000 Death from road accident 1:8,000 Low 1:1000-1:10,000 Death from road accident 1:8,000

Very low 1:10,000-1:100,000 Death from leukaemia 1:12,000 Very low 1:10,000-1:100,000 Death from leukaemia 1:12,000 Death from accident at home 1:26,000 Death from accident at home 1:26,000

Minimal 1:100,000-1:1,000,000 Death from rail accident 1:500,000 Minimal 1:100,000-1:1,000,000 Death from rail accident 1:500,000 Baycol Negligible <1:1,000,000 Death from lightening strike 1:10,000,000 Negligible <1:1,000,000 Death from lightening strike 1:10,000,000 Death from radiation from 1:10,000,000 Death from radiation from 1:10,000,000 nuclear power station nuclear power station it probably won’t be you!

* risk of transmission after contact (not death per year) * risk of transmission after contact (not death per year)

You should now know about: • Measures of risk • Different ways to describe risk You should now be able to: • Explain: − Risk Questions? − Relative risk − Odds and odds ratios − Absolute risk reduction and risk excess − Number needed to treat • Understand that odds ratios are sometimes used to describe risk • Be familiar with the concept of risk ladders • Demonstrate awareness that presenting the same risk in different ways may affect how patients (and doctors) perceive risk

Page 48

Categorical data

At the end of session, you should know about: • Approaches to analysis for bivariatecategorical data CATEGORICAL DATA At the end of session, you should be able to: • Recognise categorical data • Construct a simple frequency table Dr Jenny Freeman • Compare a single proportion to some pre-specified value two proportions • Know how to analyse data expressed in frequency tables 2x2 tables 2xk tables large tables with ordered categories

Statistical analysis: what type Statistical Analysis of test? • The main aim of statistical analysis is to use the • Depends on: information gained from a sample of individuals – Aims and objectives to make inferences about the population of interest. – Hypothesis to be tested – Type of outcome data • There are two basic approaches to statistical – Distribution of outcome data analysis: – Summary measure for outcome data – Estimation (Confidence Intervals) – Hypothesis testing (P-values)

Statistical analysis: what type Different approaches of test • What was the main purpose of my study? •There are often several different approaches – What is the main questions I want answering? to even a simple problem • What were the interventions, populations, •The methods described here and outcomes? recommended for particular types of question • Keep the analysis simple…. may not be the only methods, and may not be –but be aware of bias and confounding! universally agreed as the best method • Statistical analysis depends on the type of •However, these would usually be considered outcome data you have collected as valid and satisfactory methods for the – e.g. categorical or numerical purposes for which they are suggested here • And your pre-determined study hypotheses!

Page 49

Preparing to analyse data HYPOTHESIS TESTING: Main Steps • Data collection 1.State your null hypothesis (H0) • Data transfer (Statement you are looking for evidence to • Data checking & cleaning disprove). – Logical checks State your alternative hypothesis (Ha). 2. Obtain the probability of observing your – Range checks results, or results more extreme, if the null • Missing data hypothesis is true (P-value). –Why are data missing? 3. Use your P-value to make a decision about whether to reject, or not reject, your null hypothesis.

Categorical (Qualitative) Data What do we mean when we talk about • Dichotomous (binary) – two bivariatecategorical data? categories –Sex -male/female Categorical –Diabetic/non-diabetic • Data divided into categories, may or may not • Nominal (no natural ordering) have a natural ordering –Blood groups -A, B, O, AB –Marital status – married/single/divorced/widowed Bivariate • Ordered categorical (ordinal) • Two variables eg, blood group & eye colour or –Pain severity -mild, moderate, severe –Stage of disease gender & pain score

Example dataset Comparing two proportions •Randomised controlled trial comparing new treatment regime for leg ulcers compared to usual care Is there a difference in the leg ulcer •233 patients with venous leg ulcers randomly healing rates at 12 weeks between the allocated to intervention (120) or control (113) group. intervention (clinic) and control (home •Weekly treatment with four layer bandaging in a leg care) groups? ulcer clinic (clinic group) or usual care at home by the district nursing service (control group).

• (Morrell CJ, Walters SJ & Dixon S et al. Cost effectiveness of community leg clinics: RCT. BMJ 1998; 316: 1487-91)

Page 50 Example 1: Data Is there a difference in the leg ulcer healing rates at 12 weeks between the Treament intervention (clinic) and control Clinic Home Total (home care) groups? Outcome: Several different approaches to analysing Healed 22 (18%) 17 (15%) 39 these data: • Compare the proportions in the two groups Not healed 98 (82%) 96 (85%) 194 (technically speaking, you are using the Normal approximation to the Binomial distribution) 120 (100%) 113 (100%) 233 Total • Chi-squared test (with or without continuity

correction) • Fisher’s exact test

Example 1: Data considerations Example 1: Null & Alternative hypothesis • Type of data? State the null and alternative hypothesis:

–Ulcer healed is a dichotomous categorical H0: No difference in outcome (proportion of patients with healed leg outcome ulcers) at 12 weeks between Clinic & Home treated groups i.e. p - p = 0.0 –Two independent groups or samples Clinic Home • Best summary measure for data? More generally: There is no association between the row and column variables in the r x c i.e. they are independent. –Proportion in sample whose leg ulcer has healed

• Best comparative summary measure? HA: There is a difference in outcome (proportion of patients with healed leg ulcers) at 12 weeks between the Clinic & Home groups –Difference in proportions healed i.e. i.e. p - p = ¹ 0.00 • Most appropriate hypothesis test? Clinic Home –Comparison of two proportions/ chi-squared test / More generally: There is an association between the row and column Fisher’s exact test variables in the r x c contingency table i.e. they not independent or unrelated.

Example 1: Data Consider the more general form of the table: Treament Treament Clinic Home Total Gp 1 Gp 2 Total Outcome: Outcome: Healed 22 (18%) 17 (15%) 39 Out 1 n1 p1 n2 p2 np Not healed 98 (82%) 96 (85%) 194 Out 2 n1(1- p1) n2(1- p2) n(1-p) Total 120 (100%) 113 (100%) 233 Total n1 n2 n

Where p1 is the proportion in group 1 with the outcome of interest Where p is the proportion in group 2 with the outcome of interest 2 Where p is the proportion in total with the outcome of interest

Page 51 Difference in proportions: Calculations for hypothesis test hypothesis test n1 = 120 p1 = 22/120 = 0.183 The hypothesis test assumes that there is a common proportion, p, n = 113 p = 17/113 = 0.150 estimated by p: 2 2 (n1 p1 + n2 p2 ) = 0.033 p = p1-p2 (n1 + n2 ) (n 1 p 1 + n 2 p 2 ) ((120 ´ 0.183 )(+ 113 ´ 0.150 )) (38 .91 ) p = = (120 + 113 ) = = 0.167 And the standard error for the difference in proportions is (n 1 + n 2 ) 233 estimated by: æ 1 1 ö æ 1 1 ö æ 1 1 ö SE ( p1 - p 2 ) = p (1 - p )ç + ÷ SE ( p - p ) = p (1 - p )ç + ÷ = 0.167 (1 - 0.167 ) + = 0 .049 ç n n ÷ 1 2 ç ÷ ç ÷ è 1 2 ø è n1 n 2 ø è 120 113 ø The test statistics is: From this we can compute the test statistic z: z = (p1 -p2) / SE(p1-p2) = 0.033/0.049 = 0.673 z = (p1 -p2) / SE(p1-p2) This is then compared with the Normal distribution. When you We can then compare this value to what would be expected under do this you find that the P-value is 0.502 (i.ethe probability of the null hypothesis of no difference, in order to get a P-value observing z=0.67 or a value more extreme if the null hypothesis is true is 0.502)

Calculations for hypothesis test Example 1: results • This gives us the What does P = 0.502 mean? probability of the observing the test —Your results are likely when the null hypothesis is true statistic z or more Is this result statistically significant? extreme under the null hypothesis. —The result is not statistically significant because the P- value is greater than the significance level (a) set at 5% or • So we cannot reject 0.05 the null hypothesis. You decide? i.e. insufficient evidence of a —That there is insufficient evidence to reject thenull difference in 12 week The probability of observing the test hypothesis. Therefore there is no reliable evidence of a ulcer healing rates statistic z = 0.67 or more extreme under difference in leg ulcer healing rates at 12 weeks between the between the groups the null hypothesis is 0.502 Clinic and Home treated patient groups

Difference in proportions: Difference in proportions: 95% confidence interval 95% Confidence Interval The 95% confidence interval for the difference in • The 95% confidence interval for the difference in proportions proportions is: is: (p1 -p2) ±[1.96 x SE(p1-p2)] (p1 -p2) ±[1.96 x SE(p1-p2)]

• For the calculation of the confidence interval, we do not need (0.033) – [1.96 x 0.049] to (0.033) + [1.96 x 0.049] to make any assumptions about there being a common -0.063 to 0.129 proportion p and use the following formula for the SE(p1-p2):

Therefore we are 95% confident that the true population p1(1- p1 ) p2 (1- p2 ) difference in the proportion of leg ulcers healed, at 12 SE( p1 - p2 ) = + weeks, between the Clinic and Home treated patients n1 n2 lies somewhere between -6.3% to 12.8%, but our best estimate is 3.3%

Page 52 •This approach is only valid when the sample is large enough for the Normal approximation to apply: as a Alternative approach: rule of thumb npand n(1-p) should both exceed 5 Chi-squared test (c2 test)

where n = total number of individuals in both Situation: in both samples p = proportion of individuals with • Two independent unordered categorical condition (irrespective of group) variables that form a k x k contingency 1-p = proportion of individuals without table (nbthe current example is a 2x2 condition (irrespective of group) contingency table. • At least 80% of expected cell counts >5 •In addition, thinking of it as a difference only makes sense for 2x2 tables, i.e. where there are only two • All expected cell counts ³1. groups and two outcomes

Alternative approach • Under the null hypothesis, both treatments will have the same effect and the same proportion of leg ulcers healed at 12 weeks. Treament Clinic Home Total • We can calculate what proportion this would Outcome: be and the expected cell counts based on the 22 (18%) 17 (15%) 39 Healed expected proportion of leg ulcers healed at 12 Not healed 98 (82%) 96 (85%) 194 weeks. Total 120 (100%) 113 (100%) 233

• We base them on the overall counts we have observed in the study.

• The best estimate of the common “ulcer There is an equation based on this healing” rate at 12 weeks is the one for the total idea that helps us calculate the study, 39/233 = 16.7%. expected frequency:

• We can use this to calculate the expected number healed for each group, and the expected number not healed for each group. Expected = row total x column total Frequency N • E.g. with 120 in the Clinic Group, we expect 16.7% of their leg ulcers to have healed by 12 weeks, which 20.1.

Page 53

Clinic Home Clinic Home Result O E O E Total Result O E O E Total

Leg Ulcer 22 17 39 Leg Ulcer 22 20.1 17 39 healed at 12 healed at 12 weeks weeks Leg Ulcer not 98 96 194 Leg Ulcer not 98 96 194 healed at 12 healed at 12 weeks weeks Total 120 113 233 Total 120 113 233

Overall the total proportion with leg ulcers healed is 39/233 = 0.167 (16.7%) Overall the total proportion with leg ulcers healed is 39/233 = 0.167 (16.7%) Assuming that there is no difference in the proportion healed between the two Assuming that there is no difference in the proportion healed between the two groups, we would expect 16.7% of people in the clinic group to have leg ulcers groups, we would expect 16.7% of people in the clinic group to have leg ulcers healed i.e16.7% of 120 = 120x (39 / 233) = 20.1 healed i.e16.7% of 120 = 120x (39 / 233) = 20.1

Clinic Home Clinic Home Result O E O E Total Result O E O E Total

Leg Ulcer 22 20.1 17 18.9 39 Leg Ulcer 22 20.1 17 18.9 39 healed at 12 healed at 12 weeks weeks Leg Ulcer not 98 96 194 Leg Ulcer not 98 99.9 96 194 healed at 12 healed at 12 weeks weeks Total 120 113 233 Total 120 113 233

Overall the total proportion with leg ulcers not healed is 194/233 = 0.833 (83.3%) Overall the total proportion with leg ulcers healed is 39/233 = 0.167 (16.7%) Assuming that there is no difference in the proportion not healed between the two Assuming that there is no difference in the proportion healed between the two groups, we would expect 83.3% of people in the clinic group to have leg ulcers not groups, we would expect 16.7% of people in the home group to have leg ulcers healed i.e83.3% of 120 = 120x (194 / 233) = 99.9 healed i.e16.7% of 113 = 113x (39 / 233) = 18.9

Clinic Home Result O E O E Total • We can then compare what we have Leg Ulcer 22 20.1 17 18.9 39 observed with what we would have healed at 12 weeks expected under the null hypothesis of Leg Ulcer not 98 99.9 96 94.1 194 no difference. healed at 12 weeks Total 120 113 233 • If what we have observed is very

different from what we would have

expected, then we reject the null Overall the total proportion with leg ulcers not healed is 194/233 = 0.833 (83.3%) hypothesis. Assuming that there is no difference in the proportion not healed between the two groups, we would expect 83.3% of people in the home group to have leg ulcers not healed i.e83.3% of 113 = 113x (194 / 233) = 99.9

Page 54 Steps: • First calculate expected value for each cells (E). Formula for the χ2 statistic: • Then calculate the difference between the observed value and the expected value for each cell. 2 2 • Square each difference and divide the resultant r c 2 (Oij -Eij ) (Ο-Ε) quantity by the expected value for that cell. c =åå =å 2 • Sum all of these to get a single number, the χ (Chi- i=1 j=1 Eij Ε squared)statistic • Compare this number with tables of the chi-squared Compare this number with tables of the chi- distribution with the following degrees of freedom: squared distribution with the following degrees (no. of rows -1) x (no. of columns -1) of freedom: (no. of rows -1) x (no. of columns –1)

Chi-squared test

O E O-E (O-E)2 (O-E)2/E SPSS Output

Group * Leg ulcer healed at 12 weeks Crosstabulation Healed/clinic 22 20.1 1.9 3.61 0.180 Leg ulcer healed at 12 weeks Not healed/clinic 98 99.9 -1.9 3.61 0.036 Not healed Healed Total Group Clinic Count 98 22 120 Expected Count 99.9 20.1 120.0 Healed/home 17 18.9 -1.9 3.61 0.191 Home Count 96 17 113 Expected Count 94.1 18.9 113.0 Nothealed/home 96 94.1 1.9 3.61 0.038 Total Count 194 39 233 Expected Count 194.0 39.0 233.0

Total 233 233 0 0.445 Chi-Square Tests

Asymp. Sig. Exact Sig. Exact Sig.

Value df (2-sided) (2-sided) (1-sided)

Pearson Chi-Square .452b 1 .502 Continuity Correctiona .247 1 .620 •This value can be compared with tables for the chi- Likelihood Ratio .453 1 .501 Fisher's Exact Test .599 .310 squared distribution on 1 df. Linear-by-Linear Association .450 1 .502 •Under the null hypothesis of no association the N of Valid Cases 233 2 a. Computed only for a 2x2 table probability of observing this value of the test statistic X b. 0 cells (.0%) have expected count less than 5. The minimum expected or more, is about 0.502. count is 18.91.

Example 2: Answers Comparing two approaches What does P = 0.502 mean? Your results are likely when the null hypothesis is • Thinking of it as a difference in proportions only true. makes sense for 2 x 2 tables. Is this result statistically significant? • Both methods give identical p-values. The result is not statistically significant because the P-value is greater than the significance level (a) set • If you used the two different approaches on the at 5% or 0.05. same 2 x 2 table, you would find that the c2 You decide? value on 1 df was exactly the square of the z test That there is insufficient evidence to reject thenull statistic. hypothesis. Therefore there is no reliable evidence of a difference in leg ulcer healing rates at 12 weeks • This is not an accident -the c2 distribution was between the Clinic and Home treated patient groups. defined this way.

Page 55 Problems with small numbers • In 2 x 2 tables, even when expected cell counts are bigger than 5, the mathematical • If more than 20% of expected cell counts are approximations are not that great. less than 5 then the test statistic does not • We will reject the null hypothesis too often on approximate a chi-squared distribution. average. • If any expected cell counts are <1 then we • We can use Yates’ correction. cannot use the chi-squared distribution. • Altman (1991) recommends the use of • In large tables we may have to combine Yates’ correction for all chi-squared tests categories to make bigger numbers on 2 x 2 tables. (providing it’s meaningful). ( Ο - E - 0.5)2 C2 = å Ε

Chi-squared test with continuity correction O E |O-E|-0.5 {|O-E|-0.5}2 {|O-E|-0.5}2/E Fisher’s exact test Healed/clinic 22 20.1 1.4 1.96 0.098 • In a 2 x 2 table, when expected cell counts Not healed/clinic 98 99.9 1.4 1.96 0.020 are smaller than 5, or any <1 even Yates’ Healed/home 17 18.9 1.4 1.96 0.104 correction does not work. Nothealed/home 96 94.1 1.4 1.96 0.021 • We go back to definitions of basic probability Total 233 233 0.243 and estimate the probability of falsely

rejecting the null hypothesis directly, based •Again this value can be compared with tables for the chi- squared distribution on 1 df. on all the possible tables you could have •Under the null hypothesis of no association the observed. probability of observing this value of the test statistic • Very time-consuming by hand! 2 X CC or more, is about 0.620.

SPSS Output Chi-squared test for trend Group * Leg ulcer healed at 12 weeks Crosstabulation

Leg ulcer healed at 12 weeks • In a 2 x 3+ table, when the variable with 3+ Not healed Healed Total Group Clinic Count 98 22 120 categories is ordered, we can (and should) Expected Count 99.9 20.1 120.0 Home Count 96 17 113 take the ordering into account. Expected Count 94.1 18.9 113.0 Total Count 194 39 233 • In a similar manner to the ANOVA, the Expected Count 194.0 39.0 233.0 variation can be split into that attributable to Chi-Square Tests Asymp. Sig. Exact Sig. Exact Sig. linear trend and the rest. Value df (2-sided) (2-sided) (1-sided) Pearson Chi-Square .452b 1 .502 Continuity Correction a .247 1 .620 • In the 2 x 3+ case, this test can be retrieved Likelihood Ratio .453 1 .501 Fisher's Exact Test .599 .310 from SPSS output on the line marked Linear-by-Linear Association .450 1 .502 Linear-by-Linear Association . N of Valid Cases 233 “ ” a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 18.91.

Page 56 Summary (1) Summary (2) · Research questions need to be turned into a statement for · If we can think of the data as two proportions eg which we can find evidence to disprove -the null hypothesis. proportion in each group with a healed leg ulcer the can use the Normal approximation to the Binomial · The study data is reduced down to a single probability -the (i.e. assuming that the proportion is continuous, even probability of observing our result, or one more extreme, if the though the underlying measure, healed/not healed, is null hypothesis is true (P-value). binary) · We use this P-value to decide whether to reject or not reject the null hypothesis. •Alternative approach is to use the Chi-squared test, either with or without a continuity correction. N.b · But we need to remember that ‘statistical significance’does not Altman recommends always using the continuity necessarily mean ‘clinical significance’. correction · Confidence intervals should always be quoted with a hypothesis test to give the magnitude and precision of the •If the assumptions underlying the Chi-squared test difference. are not met then can use Fisher’s exact test

You should now know about: • Approaches to analysis for bivariatecategorical data

You should now be able to: • recognise categorical data • construct a simple frequency table • compare a single proportion to some pre-specified value two proportions • Know how to analyse data expressed in frequency tables 2x2 tables 2xk tables large tables with ordered categories

Page 57

Simple tests for continuous data handout

Recap Simple Statistical So far we have covered: •Types of data and how to describe and Tests display data •Principles of sampling, the standard Dr Jenny Freeman error and confidence intervals Acknowledgements to Stephen Walters •Process of setting and testing a statistical hypothesis & p-values •Analysis of bivariatecategorical data

Types of Data What is the Standard Error? Categorical (Qualitative) Quantitative (numerical) § What we want to know is how likely are we • Nominal (no natural • Count (can only take (in our single sample) to have captured what ordering) certain values) is going on in the population –Haemoglobin types – Number of positive tests –Sex for anaemia § And so, we calculate the STANDARD – Number of children in a ERROR • Ordered categorical (ordinal) family – Anaemic / borderline / not § The standard error (se) is used to help • Continuous (limited only by anaemic accuracy of instrument) determine how far from the true value – Grades of breast cancer – Haemoglobin (population parameter) the sample estimate concentration (g/dl) is likely to be. – Height

What is the Standard Error? Confidence Intervals

§ As the sample size increase the approximation • Confidence intervals give limits in which we are of the sample to the population improves confident (in terms of probability) that the true § Thus, all other things being equal, we would population parameter lies expect estimates to get more precise and the value of the se to decrease as the sample size • Describe the variability surrounding the sample increases. point estimate S • In general, they depend upon making standard error = assumptions about the data n

Page 58 Hypothesis testing: the main steps Confidence Intervals Set null hypothesis Set study (alternative) hypothesis • For example a 95% CI means that if you could sample an infinite number of times Carry out significance test – 95% of the time the CI would contain the true population parameter Obtain test statistic

– 5% of the time the CI would fail to contain the true Compare test statistic to hypothesized population parameter critical value • Alternatively: it gives a range of values that will include the true population value for 95% of all Obtain p-value possible samples Make a decision

P-values P-values • A p-value is the probability of obtaining your results or We say that our results are statistically significant if the results more extreme, if the null hypothesis is true p-value is less than the significance level (a) set at 5% • It is used to make a decision about whether to reject, or not reject the null hypothesis P ≤ 0.05 P > 0.05 P-value Result is Statistically significant Not statistically significant Small Large Decide That there is sufficient That there is insufficient The results are unlikely when The results are likely when evidence to reject the null evidence to reject the null the null hypothesis is true the null hypothesis is true hypothesis and accept the hypothesis alternative hypothesis • But how small is small? The significance level is usually set at 0.05. Thus if the p-value is less than this value we We cannot say that the null hypothesis is true, reject the null hypothesis only that there is not enough evidence to reject it

Aims & Objectives Today At the end of this session students should know about: •Selecting appropriate hypothesis tests. In this session we will now be putting •The difference between parametric and non-parametric some of the theory into practice and tests At the end of this session students should be able to: looking at some of the more basic •Understand which simple statistical tests to use & when simple statistical tests that you will –Paired & unpaired t-test come across in the literature and in –Wilcoxonsigned rank test (matched pairs test) & your own research Mann-Whitney U test –1-way (ANOVA) & Kruskal- Wallis test

Page 59 Statistical Analysis (2) Choosing the statistical • The main aim of statistical analysis is method to use the information gained from a • The choice of method of analysis for a sample of individuals to make problem depends on the comparison inferences about the population of to be made and the data to be used interest • Today we will look at some of the basic methods appropriate for • There are two basic approaches to comparing two groups with a statistical analysis continuous outcome – Estimation (confidence intervals) – Hypothesis testing (p-values)

Two common problems Different approaches

1.Comparison of paired data, e.g. the •There are often several different approaches to even a simple problem response of one group under different •The methods described here and conditions as in a cross-over trial, or of recommended for particular types of question matched pairs of subjects may not be the only methods, and may not be 2.Comparison of two independent universally agreed as the best method groups, e.g. groups of patients given •However, these would usually be considered different treatments as valid and satisfactory methods for the purposes for which they are suggested here

Statistical analysis: what type of test? Parametric or non-parametric? Depends upon: • Aims and objectives: what is the main purpose of my Typically statistical tests fall into two study? types: • what is the main question I want answering? • Hypothesis to be tested – Parametric • What are the interventions, populations? – Non-parametric • The type of outcome data you have collected • e.g. categorical or numerical • Distribution of the outcome data • Summary measure for the outcome data • Keep the analysis simple…. but be aware of bias and confounding! •

Page 60 Parametric methods Non-parametric methods • Nonparametric methods provide alternative data analysis techniques without assuming anything • Assume data are distributed according to a about the shape of the data. i.e. they do not assume particular distribution e.g. Normal distribution an underlying distribution for the data. Hence non- parametric methods often referred to as ‘distribution • More powerful than non-parametric tests, free’methods when the assumptions about the distribution –e.g. data may be skewed, ranked or ordinal of the data are true • Nonparametric techniques are usually based on ranks, signs • Applicable in small sample cases • Examples include the t-test, analysis of • Robust in presence of potential outliers variance, linear regression techniques NB –it is the test that is non-parametric, NOT the data

Non-parametric methods (cont) Paired or unpaired data • Nonparametric methods should not be Vital to distinguish between paired data considered as an alternative way to find and independent (unpaired) groups significant P-values! • Nonparametric methods are based on ranks –Paired data -same individuals studied at of the data and not the actual data two different times (or individually • Non-parametric methods are used when: matched) –Data does not seem to follow any particular shape or distribution (e.g. Normal) –Independent -data collected from two –Assumptions underlying parametric test not met separate groups –A plot of the data appears to be very skewed –There are potential outliers in the dataset

Example dataset Main outcome measures: •Randomised controlled trial comparing new treatment • Time to complete ulcer regime for leg ulcers compared to usual care healing •233 patients with venous leg ulcers randomly • Ulcer status at 3 and 12 allocated to intervention (120) or control (113) group. months • Ulcer free weeks •Weekly treatment with four layer bandaging in a leg • Recurrence of ulcers ulcer clinic (clinic group) or usual care at home by the district nursing service (control group). • Patient health-related quality of life (HRQoL) at • (Morrell CJ, Walters SJ & Dixon S et al. Cost baseline, 3 months and 12 effectiveness of community leg clinics: RCT. BMJ months 1998; 316: 1487-91)

Page 61 Two examples today: Main outcome measures Example 1: • Time to complete ulcer healing •Comparison of the response of one group under different conditions as in a cross- • Ulcer status at 3 and 12 months over trial, or of matched pairs of subject: • Ulcer free weeks continuous outcome • Recurrence of ulcers Example 2: •Comparison of two independent groups, • Patient health-related quality of life e.g. groups of patients given different (HRQoL) at baseline, 3 months and 12 treatments: continuous outcome months

HRQoL Example 1: Comparison of the response of one group under different conditions as in a • HRQoL was measured using the cross-over trial, or of matched pairs of general health dimension of the SF- subjects 36 Is there a change in HRQoL between baseline • This outcome is scored on a 0 (poor and three months follow-up in those patients health)– 100 (good health) scale whose leg-ulcer had healed at three months (irrespective of treatment group)?

Example 1: data considerations Example 1: Null & Alternative hypothesis • Type of data? State the null and alternative hypothesis: –HRQoL at baseline and 3 months are both continuous H0: No difference (or change) in mean HRQoL at baseline and 3 months follow-up in patients whose leg ulcer had healed by 3 months –paired i.e. μ 3 month follow-up - μ baseline = 0.0 weeks • Best summary measure for data?

–Mean HA: There is a difference (or change) in mean HRQoL at baseline and 3 • Best comparative summary measure? months follow-up in patients whose leg ulcer had healed by 3 months i.e. μ - μ ≠ 0.0 weeks –Mean of paired differences 3 month follow-up baseline • Distribution of data? Note that: In this case the two groups are paired and not independent –Normal? (distribution of paired differences) (measurements are made on the same individuals). Therefore we are interested in the mean of the differences not the difference between the two means (not • Most appropriate hypothesis test? independent groups) . –Paired t-test.

Page 62 Paired t-test Steps:

1. Calculate differences di = x1i –x2i, i = 1 to n Two groups of paired observations, x11, x12,…,x1n in group 1 and x21, x22,…,x2n in group 2. Calculate the mean d & standard deviation, sd of the differences d 2 such that x1i is paired with x2i and the i difference between x1i and x2i is di 3. Calculate the standard error of the mean difference

SE (d ) = s d n Assumptions: d • The d ’sare plausibly Normally distributed. 4. Calculate the test statistic t = i SE()d (Note it is not essential for the original observations to be Normally distributed) 5. Under the null hypothesis, t is distributed as Student’s t, with n –1 degrees of freedom • The di’sare independent of each other

2 Individual QoLat QoLat Difference di –mean (di –mean) baseline (d ) 3 months i The data

1 87 87 0 -7.33 53.78 Paired Samples Statistics 2 97 67 30 22.67 513.78 Std. Error 3 10 10 0 -7.33 53.78 Mean N Std. Deviation Mean Pair SF-36 General health 4 87 87 0 -7.33 53.78 66.3 36 18.8 3.1 1 dimension: baseline 5 47 42 5 -2.33 5.44 SF-36 General health 6 87 82 5 -2.33 5.44 dimension: 12 weeks 58.9 36 22.0 3.7 follow-up 7 65 30 35 27.67 765.44 … … … … … … 36 50 60 -10 -17.33 300.44 NB: Total 2386 2122 264 0.00 9,533 The mean difference = 7.3

Mean = 264/36 = 7.33 Standard deviation for the mean difference, sd = 16.5 SD = ?(9566/35) = 16.53

Calculations SPSS Output d = 7.3 Paired Samples Statistics Std. Error Mean N Std. Deviation Mean s = 16 .5 Pair SF-36 General health d 66.3 36 18.8 3.1 1 dimension: baseline SF-36 General health dimension: 12 weeks 58.9 36 22.0 3.7 follow-up 16 .5 Paired Samples Test SE of difference = SE( d ) = s d / n = = 2 .8 36 Paired Differences 95% Confidence Interval of the Std. Error Difference Mean Std. Deviation Mean Lower Upper t df Sig. (2-tailed) 7 .3 The probability of the observing Pair SF-36 General health t = = 2 .66 1 dimension: baseline this test statistic or more - SF-36 General 7.3 16.5 2.8 1.7 12.9 2.661 35 .012 2 .8 extreme under the null health dimension: 12 weeks follow-up hypothesis is 0.012 using the t on 36 - 1 = 35 df distribution on 35 df.

Page 63 Example 1: results Confidence interval for mean What does P = 0.012 mean? –This result is unlikely when the null hypothesis is The 100 (1-α)% confidence interval for the true mean difference in the population is Is this result statistically significant?

–The result is statistically significant because the P- d - [t1-a / 2 ´ SE (d )] to d + [t1- a / 2 ´ SE (d )] value is less than the significance level (a) set at 5% or 0.05 n You decide? (x - x ) Where å 1i 2 i –That there is sufficient evidence to reject thenull d = i=1 hypothesis and accept the alternative hypothesis n that there is a difference or change in mean HRQoL between baseline and 3 months follow-up in patients And t1-a/2 is taken from the t distribution with whose leg ulcer had healed by 3 months n -1 degrees of freedom

Example 1: CI for a mean Wilcoxon(Matched Pairs) Signed Rank Test Given that α=0.05 and t1-a/2 = 2.03 on 35 df, then the 95% CI for the mean difference is given by: • Non-parametric equivalent of the paired t-test • Used when the assumptions underlying the paired 7.3 -(2.030 x 2.8) to 7.3 + (2.030 x 2.8) t-test are not valid 1.7 to 12.9 • It is a test of the null hypothesis that there is no tendency for the outcome under one set of Therefore we are 95% confident that the true population conditions (in this current example -at the start of mean difference or change in HRQoL between baseline the study) to be higher or lower than under the and 3 months in patients whose leg ulcer had healed by comparison set of conditions (in this current then, lies somewhere between 1.7 to 12.9, but our best estimate is 7.3 example -after 3 months follow-up)

SPSS output: Example 1: results Ranks What does P = 0.012 mean? N Mean Rank Sum of Ranks SF-36 GENERAL Negative Ranks 22 a 16.11 354.50 – This result is unlikely when the null hypothesis is true HEALTH DIMENSION: Positive Ranks 8 b 13.81 110.50 3 months -SF-36 c GENERAL HEALTH Ties 6 DIMENSION: baseline Total 36 a. SF-36 GENERAL HEALTH DIMENSION: 3 months < SF-36 GENERAL HEALTH Is this result statistically significant? DIMENSION: baseline b. SF-36 GENERAL HEALTH DIMENSION: 3 months > SF-36 GENERAL HEALTH – The result is statistically significant because the P- DIMENSION: baseline value is less than the significance level (a) set at 5% c. SF-36 GENERAL HEALTH DIMENSION: 3 months = SF-36 GENERAL HEALTH DIMENSION: baseline or 0.05

Test Statisticsb

SF-36 GENERAL HEALTH DIMENSION: 3 months - You decide? SF-36 GENERAL HEALTH DIMENSION: baseline P value: probability of – That there is sufficient evidence to reject thenull Z -2.511a Asymp. Sig. (2-tailed) .012 observing the test statistic or hypothesis and accept the alternative hypothesis that a. Based on positive ranks. more extreme under the null there is a tendency for HRQoL to differ between b. WilcoxonSigned Ranks Test hypothesis. baseline and 3 months follow-up in patients whose leg ulcer had healed by 3 months

Page 64 Example 2: comparison of two Example 2: Data considerations independent groups, e.g. groups • Type of data? of patients given different –Ulcer free weeks is continuous treatments: continuous outcome –Two independent groups or samples • Best summary measure for data? Is there a difference in ulcer free weeks –Mean between the Intervention & Control • Best comparative summary measure? Groups? –Mean difference • Distribution of data? –Normal? • Most appropriate hypothesis test? –Two independent sample t-test

Example 2: Null & Alternative Parametric approach: hypothesis Independent two-sample t-test State the null and alternative hypothesis: for comparing means H0: No difference in mean ulcer free weeks between intervention & control groups Assumptions: i.e. μ - μ = 0.0 weeks Intervention Control • Two ‘independent’ groups;

HA: There is a difference in mean ulcer free weeks • Continuous outcome variable; between intervention and control groups

i.e. μIntervention - μ Control ≠ 0.0 weeks • Outcome data in both groups is Normally distributed ; Note that: In this case the two groups are independent (measurements are not on the same individuals). • Outcome data in both groups have Therefore we are interested in the difference between the similar standard deviations. means of each group.

Can check conditions are met by: Steps: • Plotting two histograms, one for each group 1. First calculate the mean difference between groups to assess Normality; it doesn’t have to be perfect, just roughly symmetric; 2. Calculate the pooled standard deviation 3. Then calculate the standard error of the difference • Calculating standard deviations – as a rough between two means estimate, one should be no more than twice 4. Calculate the test statistic t the other; 5. Compare the test statistic with the t distribution with • However the t-test is very robust to n + n -2 degrees of freedom violations of the assumptions of 1 2 6. This gives us the probability of the observing the Normality and equal variances, test statistic t or more extreme under the null particularly for moderate to large sample hypothesis sizes.

Page 65 Formulae for two sample t- Degrees of freedom test 2 2 (n -1)s + (n -1)s x1 - x2 pooled SD = 1 1 2 2 • The test statistic t = n1 + n2 - 2 SE (x1 - x 2 ) 1 1 Standard Error (SE) of difference = pooled SD ´ + n1 n2 • Is compared with the t distribution with observed difference in means (d) t = n1 + n2 -2 degrees of freedom standard error of difference (in means) • This gives us the probability of the observing

where n1 = number of subjects in 1st sample the test statistic t or more extreme under the s1 = standard deviation of 1st sample null hypothesis n2 = number of subjects in 2nd sample s2 = standard deviation of 2nd sample

Calculations The data: where n1 = number of subjects in 1st sample = 120 s1 = standard deviation of 1st sample = 18.5 Group Statistics n2 = number of subjects in 2nd sample = 113 Std. Error s2 = standard deviation of 2nd sample = 17.6 Group N Mean Std. Deviation Mean Leg ulcer free Clinic 120 20.1 18.5 1.7 (120 - 1)18.5 2 + (113 - 1)17.6 2 time (weeks) Home 113 14.2 17.6 1.7 pooled SD = = 18.07 120 + 113 - 2 1 1 SE of difference = SE(d) = 18.07 ´ + = 2.37 120 113 20.1-14.2 The probability of the observing Mean difference = 20.1 – 14.2 = 5.9 weeks t = = 2.49 this test statistic or more extreme 2.37 under the null hypothesis is 0.014 on 120+113- 2 = 231 df using the t distribution on 231 df.

SPSS Output Example 2: Answers Group Statistics What does P = 0.014 mean? Std. Error Group N Mean Std. Deviation Mean —Your results are unlikely when the null hypothesis Leg ulcer free Clinic 120 20.1 18.5 1.7 is true. time (weeks) Home 113 14.2 17.6 1.7 Is this result statistically significant?

Independent Samples Test —The result is statistically significant because the Levene's Test P-value is less than the significance level (a) set for Equality of Variances t-test for Equality of Means at 5% or 0.05. 95% Confidence Interval of the You decide? Mean Std. Error Difference F Sig. t df Sig. (2-tailed) Difference Difference Lower Upper Leg ulcer free Equal variances —That there is sufficient evidence to reject thenull 1.632 .203 2.485 231 .014 5.9 2.4 1.2 10.5 time (weeks) assumed Equal variances hypothesis and accept the alternative hypothesis 2.489 230.982 .014 5.9 2.4 1.2 10.5 not assumed that there is a difference in mean ulcer free weeks between the Intervention and Control Groups

Page 66 Confidence interval for Example 2: CI for the difference difference between means between two population means • The 100 (1-a)% confidence interval for the difference in the two population means is The 95% CI for the population difference in the two population means is then given by: d – [t1-a/2 x SE(d)] to d + [t1-a/2 x SE(d)] 5.9 -(1.970 x 2.37) to 5.9 + (1.970 x 2.37) 1.2 to 10.5 weeks where d = x1 - x 2

• And t1-a/2 is taken from the t distribution with n1 + n2 - Therefore we are 95% confident that the true population 2 degrees of freedom. mean difference in ulcer free weeks between Clinic and Home treated patients lies somewhere between 1.2 to 10.5 e.g. 233-2 = 231 df, so for a 95% confidence interval weeks, but our best estimate is 5.9 weeks. t = 1.970 0.025

Non-parametric methods: Histograms for ulcer free comparison of 2 time by study group independent groups Clinic group (n=120) Control group (n=113) • If the assumptions underlying the unpaired t- 50 60

50 test do not hold then can use a non-parametric 40

40 test 30 30

20 • The non-parametric equivalent of the unpaired 20

10 t-test is the Mann-Whitney U test 10

0

0 F r equen cy F r equen cy • More general test of the null hypothesis that the 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 distribution of the outcome variable in the two Leg ulcer free time (weeks) Leg ulcer free time (weeks) groups is the same

Null & Alternative hypothesis Steps: for the Mann-Whitney test 1. First arrange all the data in increasing order (smallest observation to the largest) State the null and alternative hypothesis: 2. Choosing one group; for each observation in that group count how many observations in the other group lie below it H0: No tendency for patients in one group to exceed patients in the other group with respect to the number of 3. Add all of these numbers up to get the U-statistic ulcer free weeks 4. Compare the U test statistic with it’s theoretical distribution under the null hypothesis (that the samples come from the same population) HA: There is a tendency for patients in the intervention group to exceed patients in the control group with respect 5. From this we can find out the probability of the to the number of ulcer free weeks observing the test statistic U or a value more extreme under the null hypothesis

Page 67 SPSS output: What happens when there are more

Ranks than 2 independent groups?

Group N Mean Rank Sum of Ranks • The Analysis of variance technique (ANOVA) Leg ulcer free Clinic 120 126.87 15224.00 time (weeks) Home 113 106.52 12037.00 –Parametric test Total 233 –Similar to the t-test but extended for more than two

Test Statisticsa groups P value: as the value of 0.017 is less Leg ulcer free than the significance level (α) set at 0.05 time (weeks) Mann-Whitney U 5596.000 or 5% this means that the result Wilcoxon W 12037.000 obtained is unlikely when the null • Kruskal-Wallis Test Z -2.388 hypothesis is true. Thus there is Asymp. Sig. (2-tailed) .017 sufficient evidence to reject the null –Non-parametric equivalent of analysis of variance a. Grouping Variable: Group hypothesis and accept the alternative hypothesis that there is a difference –Similar to Mann-Whitney U test but extended for in ulcer free weeks between the clinic more than two groups and control groups

Summary Summary • Research questions need to be turned into a You should now know about: statement for which we can find evidence to disprove • Basic methods appropriate for comparing two groups -the null hypothesis – Paired data • The study data is reduced down to a single – Unpaired data probability -the probability of observing our result, or • The difference between parametric and non-parametric one more extreme, if the null hypothesis is true (P- tests value) You should now be able to: • We use this P-value to decide whether to reject or not • Understand and apply the following statistical tests: reject the null hypothesis – Paired t-test & Wilcoxonsigned rank test (matched • But we need to remember that ‘statistical significance’ pairs test) does not necessarily mean ‘clinical significance’ – Unpaired t-test & Mann-Whitney U test • Confidence intervals should always be quoted with a hypothesis test to give the magnitude and precision of the difference

Questions?

Page 68 Correlation and Regression

This week we will look at Correlation and regression: how to: When two variables meet Investigate the relationship between two continuous variables measured on the same sample of subjects Dr Jenny Freeman

At the end of session, you should know about: Recap: Types of Data • Approaches to analysis for simple continuous Categorical (Qualitative) Quantitative (numerical) bivariatedata • Nominal (no natural • Count (can only take ordering) certain values) At the end of session, you should be able to: –Haemoglobin types – Number of positive tests • Construct and interpret scatterplots for –Sex for anaemia quantitative bivariate data – Number of children in a • Ordered categorical (ordinal) family • Identify when to use correlation – Anaemic / borderline / not • Continuous (limited only by • Interpret the results of correlation coefficients anaemic accuracy of instrument) – Grades of breast cancer • Identify when to use linear regression – Haemoglobin • Interpret the results for linear regression concentration (g/dl) – Height

What do we mean when we talk about …categorical bivariatedata bivariatedata? example from Risk lecture

• Data where there are two variables Baycol Other statins • The two variables can be either Number who die 2 1 from rhabdomyolysis categorical, or numerical Number alive or died 999,998 9,999,999 • This session we are dealing with from other causes Total 1,000,000 10,000,000 continuous bivariatedata i.e. both variables are continuous • There are two binary (categorical) variables − Type of statin(Baycol/ other) • We have also looked at categorical − Whether died of rhabdomyolysisor not • From these data we examined the risk of death from bivariatedata … rhabdomyolysisof Baycolcompared to other statins

Page 69 Association between two variables: Correlation: correlation or regression? are two variables associated? There are two basic situations: When examining the relationship 1.There is no distinction between the two variables. No causation is implied, simply between two continuous variables association: ALWAYS look at the scatterplot, as − use correlation you will be able to see visually the pattern of the relationship between 2.One variable Y is a response to another variable X. You could use the value of X to them predict what Y would be: − use regression

Teenage pregnancy example

Figure: Relationship between teenage pregnancy and adult Teenage pregnancy example smoking rates, East Midlands local authorities, 2001 80 •There appears to be a linear relationship between adult smoking rates and teenage pregnancy 70 •So, now what do you do….? 60 ….. could calculate the correlation coefficient 50 •This is a measure of the linear association between 40 two variables 30 •Used when you are not interested in predicting the

20 value of one variable for a given value of the other variable P r egnan cy a t e ( pe 1000 w o m en ged 15 - 17) 10 16 18 20 22 24 26 28 30 32 34 36 •Any relationship is not assumed to be a causal one Estimated smoking prevalence (%) –it may be caused by other factors

Teenage pregnancy example

Figure: Relationship between teenage pregnancy and adult Properties of Pearson’s correlation smoking rates, East Midlands local authorities, 2001 80 coefficient (r)

70 • r must be between -1 and +1

60

50 •+1 = perfect positive linear association

40

30 •-1 = perfect negative linear association

20 Correlation coefficient=0.94 P r e gnan c y a t ( p 1 000 w o me n g ed 5 - 17) 10 • 0 = no linear relation at all 16 18 20 22 24 26 28 30 32 34 36 Estimated smoking prevalence (%)

Page 70 Consider the following graphs, what do you think their value for r could be?

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0 A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0 A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

Page 71 Confidence interval for the correlation coefficient • Complicated to calculate by hand, but useful

Hypothesis tests • Can be done, the null hypothesis is that the population correlation r = 0, but this is not very useful as an estimate of the strength of an association, because it is influenced by A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0 the number of observations (see next slide)…..

Value at which the And so what do correlations of Sample size correlation coefficient 0.63 and 0.16 look like? becomes significant at the 5% level 10 0.63 20 0.44 50 0.28 Correlation = 0.63, p=0.048 (n=10) 100 0.20 150 0.16

Teenage pregnancy example

And so what do correlations of Figure: Relationship between teenage pregnancy and adult 0.63 and 0.16 look like? smoking rates, East Midlands local authorities, 2001 80

70

60

50

40

30

Correlation = 0.63, p=0.048 (n=10) Correlation = 0.16, P=0.04 (n=150) 20

P r e g na n cy a t ( p 1 0 w o m en ge d 15 - 17) 10 16 18 20 22 24 26 28 30 32 34 36 Estimated smoking prevalence (%)

Page 72 Teenage pregnancy example

Teenage pregnancy example: Figure: Relationship between teenage pregnancy and adult null & alternative hypothesis smoking rates, East Midlands local authorities, 2001 80

State the null and alternative hypothesis: 70

60

H0:No relationship or correlation between 50 adult smoking and teenage pregnancy rates i.e.population correlation coefficient (r) = 0.0 40

30 H :There is a relationship or correlation between A 20 adult smoking and teenage pregnancy rates Correlation coefficient=0.94 P r egn a n cy t e ( pe 100 0 w o m en aged 1 5 - 7) 10 i.e.population correlation coefficient (r) ¹ 0.0 16 18 20 22 24 26 28 30 32 34 36 Estimated smoking prevalence (%)

Example: Answers Points to note The correlation coefficient is 0.94 (p< 0.001) • Do not assume causality -a different What does P < 0.001 mean? variable could have caused both to change – Your results are unlikely when the null hypothesis is true together Is this result statistically significant? •Be careful comparing r from different – The result is statistically significant at the 5% level because studies with different the P-value is less than the significance level (a) set at 5% or n 0.05 •Do not assume the scatterplotlooks the

You decide? same outside the range of the axes – That there is sufficient evidence to reject the null hypothesis •Avoid multiple testing and therefore you accept the alternative hypothesis that there is a correlation between adult smoking and teenage pregnancy •Always examine the scatterplot! rates

Teenage pregnancy example Teenage pregnancy example

Figure: Relationship between teenage pregnancy and adult Figure: Relationship between teenage pregnancy and adult smoking rates, East Midlands local authorities, 2001 smoking rates, East Midlands local authorities, 2001 80 80

70 70

60 60

50 50 correlation coefficient=0.52

40 40

30 30

20 20 correlation coefficient=0.94 Correlation coefficient=0.94 P r egnan cy a t e ( pe 1000 w o m en aged 15 - 17) 10 P r e g nan cy a t ( pe 100 0 w o m en ag d 15 - 17) 10 16 18 20 22 24 26 28 30 32 34 36 16 18 20 22 24 26 28 30 32 34 36 Estimated smoking prevalence (%) Estimated smoking prevalence (%)

Page 73 Association between two variables: Regression:quantifying the relationship Correlation or regression? between two continuous variables

There are two basic situations: Teenage pregnancy example: 1.There is no distinction between the two If you believe that the relationship is causal i.e. that variables. No causation is implied, simply the level of smoking in an area affects the teenage association: pregnancy rate for that area, you may want to: use correlation − • Quantify the relationship between smoking and 2.One variable Y is a response to another the teenage pregnancy rate variable X. You could use the value of X to • Predict on average what the pregnancy rate predict what Y would be: would be, given a particular level of smoking − use regression

Teenage pregnancy example:

Figure : Relationship between teenage pregnancy rates and a Regression:quantifying the relationship composite deprivation score - Local Authorities 1999-2001 between two continuous variables 80

60 Teenage pregnancy example: However, in this case it would not be sensible as both are 40 mediated by deprivation. So let’s look at the rates of teenage pregnancy by area deprivation. If we believe that deprivation is causally linked with teenage pregnancy we 20

could: R a t e ( pe r 10 0 w o m n ged 15 - 1 7) • Quantify the relationship between smoking and the 0 0 1000 2000 3000 4000 5000 6000 7000 8000 teenage pregnancy rate Composite deprivation score (higher values indicate increased deprivation) • Predict on average what the pregnancy rate would There appears to be a linear relationship between deprivation and be, given a particular level of smoking teenage pregnancy …… Ref: www.empho.org.uk/whatsnew/teenage-pregnancy-presentation.ppt

Teenage pregnancy example Figure: Relationship between teenage pregnancy rates and a composite deprivation score – Local Authorities 1999-2001 • Always plot the graph this way round, with 80 the explanatory (independent) variable on the horizontal axis and the dependent 60 variable on the vertical axis ( d e p nd n t v a r i b l e)

40 • We try to fit the “best” straight line

20

R a t e ( p r 1 0 w o m n g d 5 - 7) • If the relationship is linear, this should give 0 Y R e s po n se v a r i b l 0 1000 2000 3000 4000 5000 6000 7000 8000 the best prediction of Y for any value of X Compsite deprivation score (higher values indicate increasing deprivation) X Predictor / explanatory variable (independent variable)

Page 74 Teenage pregnancy example

Figure: Relationship between teenage pregnancy rates and a Estimating the best fitting line composite deprivation score – Local Authorities 1999-2001 80

• The standard way to do this is using a 60

method called least squares using a ( d e p n t v a r i b l e)

computer. 40

20

• The method chooses a line so that the R a t e ( p r 10 0 w o m n age d 15 - 17)

0 square of the vertical distances between Y Re s p on se v a r i b l e 0 1000 2000 3000 4000 5000 6000 7000 8000 the line and the point (averaged over all Compsite deprivation score (higher values indicate increasing deprivation) points) is minimised. X Predictor / explanatory variable (independent variable)

Estimating the best fitting line Equation of the line Y = a + bX The line can be represented numerically by an equation (the regression equation), which b is the slope or gradient of the line The amount of change in Y for a one includes two coefficients, one for the unit change in X intercept (the value of the dependent variable, when the independent variable is equal to zero) and the slope (the average ( D e p nd n t va r i a b l e) change in the dependent variable for a unit change in the x variable): Dependent variable Independent variable a is the intercept –value of Y when X is zero y = a + b x Y R e s pon v a r i b le Intercept Slope 0 X Predictor / explanatory variable (Independent variable)

Teenage pregnancy example

Figure: Relationship between teenage pregnancy rates and a composite deprivation score – Local Authorities 1999-2001 Assumptions for the linear

80 regression model to be valid:

60 • The residuals are Normally distributed for Regression line each value of X (predictor variable). ( d e p nd n t v a r i b l e)

40 • The variance of Y is the same at each Slope is the average value of X. change in the y 20 variable for a • The relationship between the two variables change of one unit Rate (per 1000 women aged 15-17) Intercept (where the lie in the x variable is linear. crosses the y axis) 0

Y R e s po n se v a r i b l 0 1000 2000 3000 4000 5000 6000 7000 8000 • You do not have to have X random or X Compsite deprivation score (higher values indicate increasing deprivation) Normally distributed. X Predictor / explanatory variable (independent variable)

Page 75 Residuals Teenage pregnancy example equation • Residuals are the: Example: observed value minus the fitted value − Pregnancy rate = 13.04 + 0.006 x deprivation Y -Y obs fit here, a = 13.04 (intercept) i.e. the dashed lines on the previous slide − b = 0.006 (slope)

• Plots involving residuals can be very i.e. for every unit increase in deprivation score there informative. They can: are an additional 0.006 pregnancies per 1000 women aged 15-17 (or an extra 6 per million − help assess if assumptions are valid women aged 15-17) − help assess if other variables need to be taken into account

Teenage pregnancy example equation Prediction •Often in papers, when presenting the results of a • Regression slopes can be used to predict the value you will see a quantity known as r2 of the dependent variable with a particular value of quoted the predictor / explanatory / independent variable •This is the proportion of variance explained by the • The slope, b, indicates the strength of the predictor variable and is a measure of the fit of the relationship between x and y model to the data. It can be expressed as a percentage •For our example the r2 value is 0.646, thus 64.6% of • We are often interested in how likely we are to the variability in the teenage pregnancy rate is obtain our value of b if there is actually no explained by variation in the deprivation score relationship between x and y in the population –One way to do this is to do a test of significance for NB: This is the square of the correlation coefficient the slope (b) 0.8042 = 0.646

Teenage pregnancy example

Figure : Relationship between teenage pregnancy rates and a Caveats composite deprivation score - Local Authorities 1999-2001 80 • Do not use the graph or regression model to predict outside of the range of 60 observations 40

20

• Do not assume just because you have an R a t e (p r 1 0 00 w o me n g 15- 7) Regression line: equation that means that X causes Y pregnancy rate = 13.04 + 0.006 x deprivation score 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Compsite deprivation score (higher values indicate increased deprivation) Ref: www.empho.org.uk/whatsnew/teenage-pregnancy-presentation.ppt

Page 76 Teenage pregnancy example

Figure : Relationship between teenage pregnancy rates and a Association between two variables: composite deprivation score - Local Authorities 1999-2001

80 Correlation or regression? (1) We have now learned that there are two basic Regression line: situations: 60 Pregnancy rate = 26.4 + 0.003 x deprivation score 1. There is no distinction between the two

40 variables. No causation is implied, simply association: − use correlation 20 Rat e (p r 100 0 w o m n a ge 1 5-17) Regression line: 2. One variable Y is a response to another pregnancy rate = 13.04 + 0.006 x deprivation score 0 variable X. You could use the value of X to 0 1000 2000 3000 4000 5000 6000 7000 8000 predict what Y would be: Compsite deprivation score (higher values indicate increased deprivation) − use regression Ref: www.empho.org.uk/whatsnew/teenage-pregnancy-presentation.ppt

Association between two variables: You should now know about: Correlation or regression? (2) • Approaches to analysis for simple continuous bivariatedata – correlation and regression • Correlation is used to denote association between two quantitative variables. The degree of association is estimated using the correlation coefficient. It You should now be able to: measures the level of linear association between the two variables • Construct and interpret scatterplots for quantitative bivariate data • Regression quantifies the relationship between two quantitative variables. It involves estimating the best • Identify when it is appropriate to use correlation straight line with which to summarise the • Interpret the results of correlation coefficients association. The relationship is represented by an equation, the regression equation. It is useful when • Identify when it is appropriate to use linear we want to describe the relationship between the regression variables, or even predict a value of one variable for a given value of the other • Interpret the results of a linear regression

Formula for Pearson’s r

• Given a set of n pairs of observations (x1,y1), (x2,y2),….(xn,yn) the Pearson correlation coefficient r is given by: Questions? n å (xi - x )(y i - y ) r = i =1 n n ì 2 2 ü íå (xi - x ) å (y i - y ) ý î i =1 i =1 þ

• For this equation to work X and Y must both be continuous variables, (and Normally distributed if the CI and hypothesis test are to be valid). • It is easier to do it on a computer!

Page 77 Hypothesis test for r Confidence interval for r

• To test whether the population correlation • A 100 (1-a)% CI for the population coefficient, r, is significantly different from zero, correlation coefficient, r: calculate: (1 - r 2 ) SE ()r = (n - 2 ) r – t1-a/2 SE(r) to r + t1-a/2SE(r) and the test statistic r t = SE ()r • Where t1-a/2 are values from tables of t distribution n -2 degrees of freedom. • Compare the test statistic with the t-distribution with n – 2 degrees of freedom.

Formula for estimating a and b Significance test and CI for b • Given a set of n pairs of observations • To test whether b is significantly different from zero, calculate: n n E = (y - y )2 - b 2 (x - x )2 (x ,y ),(x ,y ), .(x ,y ). xy å i å i 1 1 2 2 … n n i =1 i =1 n E = n - 2 x - x 2 • The regression coefficient b of y given x is: xx ( )å ( i ) i =1 n and E (xi - x)(yi - y) SE ()b = xy å E i=1 xx b = n 2 • Compare t = b/SE(b) with a t distribution with n -2 å(xi - x) degrees of freedom. i=1 • A 100 (1-a)% CI for the population slope, b, with n -2 The intercept a is estimated as degrees of freedom is given by: b – t SE(b) to b + t SE(b) a = y - bx 1-a/2 1-a/2

Page 78 Appendix

Introduction to SPSS for Windows

1.1 Introduction

SPSS (Statistics Package for the Social Sciences) is a comprehensive and flexible statistical analysis and data management system first introduced in 1970. SPSS can take data from almost any type of file and use them to generate tabulated reports, charts and plots of distributions and trends, descriptive statistics and complex statistical analyses.

SPSS for Windows brings the full power of the mainframe version of SPSS to the personal computer environment. It will enable you to perform many analyses on your PC that were once possible only on much larger machines. You can now analyse large data files with thousands of variables. The algorithms are identical to those used in the SPSS software on mainframe computers, and the statistical results will be as precise as those computed on a mainframe.

SPSS for Windows provides a user interface that makes statistical analysis more accessible for the casual user and more convenient for the experienced user. Simple menus and dialogue box selections make it possible to perform complex analyses without typing a single line of command syntax.

1.2 Opening SPSS

SPSS for Windows is a Windows package so you need to be in the Windows environment. Depending on how your computer is set up you can start SPSS in several ways. If the SPSS icon is already displayed on your desktop, then simply position the mouse cursor over the icon and double click the left mouse button.

Figure 1.1a

Page 79 Alternatively, if the SPSS icon is not shown on the computers Desktop or Taskbar, you can launch SPSS via the Start button, Programs, SPSS for Windows menus.

Figure 1.1b

Once SPSS is opened this should bring up a screen similar to the one below.

Figure 1.2

The SPSS Data Editor window has two sub-windows, a data window and a variable view window. They are labelled Data View and Variable View. You can switch between the

Page 80 Windows by clicking on the Data View and Variable View tabs in the bottom left corner of the window

Data View: Contains a data spreadsheet that works specifically with SPSS and only one can be open at a time. Variable View: Contains a data spreadsheet that has the variable definitions for each of the variables in the SPSS data file.

A third window, which is not shown, is the SPSS output window.

Output1: Has the output from any procedure you run. You can have one or more output windows open but output will only go in one at a time.

As SPSS is a Windows package across the top is a menu bar, the menu features File, Edit, Window, and Help are similar to other Windows applications. If you click on Window you will see that you can choose which window you have to the front.

2. Entering Data into SPSS

Data are stored in the Data View window as variables in distinct columns. On first entering SPSS, all columns are empty and labelled in light grey ‘var’. Row numbers are given down the left side of the window. Data from only one person or unit should be stored in each row.

There are two stages to entering data into SPSS. The first is to define what the variables look like. The second is to enter the data.

2.1 Defining a Variable

Define a variable by clicking on the Variable View tab option on the bottom left of the SPSS Data Editor window to get the following screen:

Figure 2.1

Page 81

On the first row in the Name column type the variable name of your choice, for example patid (beginning with a letter and no longer than 8 characters, with no spaces) and press return.

The other columns (Type, Width, Decimals, Label, Values, Missing, Columns, Align etc) will now have their default values.

Figure 2.2

If you click on the button labelled Type... you will get the dialogue box below:

Figure 2.3

Page 82 The top four formats are numeric formats with various appearances. Date format is not just one set format but also 27 that appear in a scroll list when the date button is selected. Dollar deals with American Currency (SPSS is American) and custom currency you are unlikely to use. String deals with all alphanumerics that do not fit into any of the above criteria. Try to choose as high up the list as possible for any variable as flexibility increases the higher up the list you go.

Width and Decimals specify the maximum number of characters allowed and the number of decimal places. For Numeric variables the default is eight numbers with two decimal places.

You can also specify Labels. These allow SPSS to print out more information about what a variable is. For example the variable ‘weight’ can have the label ‘Weight in Kilograms’, or the variable patid could have the Label “Unique Patient Identification Number”.

If you click on the Values cell you will get the dialogue box below:

Figure 2.4

Most analyses in SPSS cannot handle string variables. To overcome this SPSS has Value Labels. These allow you to substitute numbers for the strings and then attach a label to each value of the variable. For example enter ‘0’ or ‘1’ instead of M or F for Male/Female.

Missing values are where a unit’s value is not known for some reason, e.g. the variable is inapplicable or the variable was not measured. You can in SPSS for Windows have several different missing values for a variable. SPSS needs to know what values are assigned to missing values in a variable. If SPSS is not given this information separately it will treat the missing value as a valid value.

If you click on the Missing cell in the Variable View dialogue box you will get the dialogue box below:

Page 83 Figure 2.5

You can have discrete or whole integer number missing values e.g. “999” or a range of missing values.

You can repeat the process with all the other variables in your dataset.

When the definitions of all the variables in your dataset are complete, you are now ready to enter the actual data so click on the Data View tab (in the bottom left of the screen) to return to the data entry window.

2.2 Entering Data

Once the variables have been defined, entering data in SPSS is similar to entering data into a spreadsheet. Make sure your active screen is the Data View screen then place the cursor in the cell where you want to enter data. Type the value you want and press Enter. This will enter the value in the cell and take you down to the cell below. You can move from cell to cell using the arrow keys.

2.3 Saving Data

From the data window, data is saved as an SPSS data file by using the menu option File, followed by Save or if you want to change the name or format of the data Save As....

SPSS data files are saved with the .sav extension.

SPSS output files are saved with the .spo extension.

SPSS syntax files are saved with the .sps extension.

Page 84 Figure 2.6

Data may be retrieved from an SPSS data file. This is done by again using the File menu option from the data window, then Open. The Data... option will then prompt you for a filename.

Data stored in formats other than SPSS for Windows may be retrieved by changing the Files of type option at the bottom of the Open file window.

Figure 2.7

3. Transforming Variables

We often transform data by taking the logarithm, square root, reciprocal, or other such function of the data. We then analyse the transformed data rather than the untransformed or raw data. We

Page 85 do this because many statistical techniques, such as t tests, regression, and analysis of variance, require that data follow a distribution of a particular kind. The observations themselves must come from a population that follows a normal distribution, and different groups of observations must come from populations that have the same variance or standard deviation.

Many biological variables do follow a normal distribution with uniform variance. Many of those which do not can be made to do so by a suitable transformation. Fortunately, a transformation which makes the data follow a normal distribution often makes the variance uniform as well, and vice versa.

The Transform menu allows you to alter variables in SPSS.

Figure 3.1 Calculations

grouping

Occurrence Ranks, quartiles A method of converting strings to numbers and value

If you select the Compute... option you will get the following dialogue box:

Page 86 Figure 3.2

Enter formula for the calculation in this box

List of functions available in SPSS

Click on this button

Variable that will hold Click here if the calculation is the outcome of the when formula only for certain cases.

It is possible to do a vast range of calculations using this screen, not just simple mathematical formulae (e.g. logical calculations and standard string edits).

It would be possible to group a continuous variable into categories using repeated transforms or complicated formulae, however SPSS has a much neater way of doing this recode.

For example, suppose we wish to recode the continuous variable age into several ordered categories, i.e. aged 0 - 9 = 1; 10 - 19 = 2; 20 - 29 = 3 etc.

To get to recode you need to select the Recode option from the Transform menu. You will then be offered to recode Into Same Variables... or Into Different Variables.... Always select Into Different Variables... unless you are VERY CONFIDENT that what you recode is going to be permanent. If you recode Into Same Variables... you cannot undo the recode.

Page 87 Figure 3.3

You will then get the following dialogue box:

Figure 3.4

Enter name of variable that will contain the group values and press change

Se lect variables Click here Press this button to set the w ith values in when OK groupings of values th at require

When you click on the Old and New Values... box you will get the following dialogue box:

Page 88 Figure 3.5

Into this you need to enter the values that are being grouped into the new groups. Press Continue when you have finished.

4. Graphs

When analysing data it is important to look at the data graphically, as well as carrying out statistical tests. SPSS has a wide range of graphs available.

4.1 Graph Menu

Most of the graphs procedures are available from the Graphs menu. There are 18 types of graph available from the Graphs menu including bar charts, line charts pie charts, boxplots, scatter plots, histograms, and normal plots.

Figure 4.1

Page 89 4.2 Bar Charts

For categorical variables (such as sex, group, age category, BMI category etc.), it is straightforward to present the number in each category, usually indicating the frequency or percentage of the total number of subjects. When shown graphically this is called a bar diagram or a bar chart.

By selecting Bar... you can create a bar chart. You will get something similar to the following dialogue box:

Figure 4.2

Click on the Simple button icon on the left and then the Define button on the right. You should then get the Define Simple Bar: Summaries for Groups of Cases dialogue box.

Figure 4.3

Page 90 You can make the bars represent other summary functions such as the percentage of cases (% of cases) or the cumulative number of cases (Cum. n of cases) rather than the default number of cases (N of cases).

You may also want to give your graph a title, so click on the Titles… button.

4.3 Histograms

Bar charts are inappropriate for continuous variables (such as age, height, weight, BMI etc.). The equivalent graph for a continuous variable is a histogram. In a histogram the number (or frequency) of observations is plotted for different values or groups of values. Select Histogram... from the Graphs menu and the dialogue box below will come up:

Figure 4.4

4.4 Box plots

Box plots or box-and-whisker plots are a graphical method of displaying the important characteristics of a set of continuous observations. The display is based on a five-number summary of the data, with the thick black horizontal line in the box indicating the median value, with the ‘box’ part covering the inter-quartile range, and the ‘whisker’ extending to include all but outside observations, these being indicated separately.

SPSS box plots include two categories of cases with outlying values. Cases with values more than 3 box-lengths from the upper or lower edge of the box are called extreme values (marked with a *). Cases with values that are between 1.5 and 3 box lengths from the upper or lower edge of the box are called outliers (and designated with O).

For a single set of data a histogram is more informative, but several sets of data can be summarised economically using the box plot. (For example boxplots of age by sex).

To obtain a box plot select Boxplot... off the Graphs menu to obtain the following dialogue box:

Page 91 Figure 4.5

Click on the Simple button icon on the left and then the Define button on the right. You should then get the Define Simple Boxplot: Summaries for Groups of Cases dialogue box.

Figure 4.6

You could select age as the Variable and sex for the Category Axis.

4.5 Scatter diagram or Scatterplots

The relationship between two continuous variables (such as height and weight) may be shown graphically in a scatter diagram or scatterplot. This is a simple graph in which the values of one variable are plotted against those of another.

To do this you need to select Scatter... from the Graphs menu to obtain the following dialogue box:

Page 92 Figure 4.7

Most of the time a simple scatterplot is all we need. So click on the Simple button icon on the left and then the Define button on the right. You should then get the Simple Scatterplot dialogue box.

Figure 4.8

The Y Axis is the vertical axis and the X Axis is the horizontal axis. You can also have different markers for different groups of subjects (e.g. males and females) by selecting sex as the Set Markers by: variable.

5. Statistics

If you are using SPSS you are almost certainly going to want to carry out some sort of statistical analysis. To do this you have to use the Analyze menu.

Page 93 The type of statistical analysis fundamentally depends on what was the main purpose of the study. In particular, what are the main questions we want answering? We should try and keep the statistical analysis as simple as possible, but we should be aware of bias and confounding! The statistical analysis depends on the type of outcome data you have collected, for example whether or not the data is categorical or numerical and your pre-determined study hypotheses.

5.1. What type of test?

What type of statistical test depends on the answer to the five key questions, described below? 1. Aims and objectives 2. Hypothesis to be tested 3. Type of outcome data 4. Distribution of the outcome data 5. Summary measure for outcome data

Given the answers to all these five questions we should now be able to proceed with an appropriate statistical analysis of the data we have collected.

5.2 Preparing to analyse data

Before analysing a set of data it is important to check as far as possible that the data seem correct. Errors can be made when measurements are taken (data collection), when the data are originally recorded, when they are transcribed from the original source (such as from hospital notes), or when being typed into a computer (data transfer). We cannot know what is correct, so we restrict our attention to making sure that the recorded values are plausible. This process is called data checking (or data cleaning). We cannot expect to spot all transcription and data entry errors, but we hope to find the major errors. Since the data is being analysed on a computer, the checking should take place after the data have been entered on the computer.

Data checking & cleaning Errors in recorded data are common. Data checking aims to identify and, if possible, rectify errors in the data. Clearly errors in the original data cannot be rectified, but errors introduced at a later stage can be put right if the original record is consulted. Checking the data is likely to reveal that some observations that, while plausible, are distant from the main body of the data. It is also likely to reveal that a number of intended observations are missing.

Logical and Range checks For categorical data it is simple to check that all recorded data values are plausible because there are a fixed number of pre-specified values. For example, if we have two codes for gender, as follows: 1 = female and 2 = male, then we expect to find only values 1 and 2 in the data, except for any subjects with missing information. If missing values are coded as 9, then we know that any gender coded as 0, 3, 4, 5, 6, 7 or 8 is clearly wrong.

For continuous measurements we cannot usually identify precisely which values are plausible and which are not, and it is important not to do so. However, it should always be possible to specify lower and upper limits on what is reasonable for the variable concerned. For example, with age we may use limits of 0 and say 120. Age values above 100 should be checked since although these are possible they are unlikely. Values remaining outside the pre-specified range must be left as they are, or recorded as ‘missing’ if they are felt to be impossible rather than just unlikely.

Missing data

Page 94 Another by-product of checking your data is that any missing observations will be identified. A common device is use codes such as 9, 99, 999 or 99.9, according to the nature of the variable, although some computer programs, such as SPSS allow . to indicate a missing observation. If a numeric missing value is used in SPSS it is essential to identify the value as a ‘user defined’ missing value before analysing the data. It is easy to forget that one or two values are missing, perhaps coded as 999, when carrying out an analysis. The effects on the analysis can be severe. The advantage of using . or a system missing value is that there is no danger that subsequent analyses will treat the missing value code as a real observation.

Why are data missing? It is worth thinking about why the data are missing? In particular we ought to know if there is a reason related to the nature of the study. As with impossible values, it may be possible to check with the original source of the information that missing observations are really missing. Frequently values are missing essentially at random, for reasons not related to the study.

5.3 Statistics or Analyze menu

The statistics or Analyze menu appears below. There are 17 options on it. Each of these leads to further sub-menus with various statistical techniques.

Figure 5.1

A brief selection of the various techniques and tests are described below.

5.4 Describing data - Frequencies

Frequencies… in SPSS is the technique commonly used for getting a basic description of the data. It will not only produce a frequency count but will calculate a wide range of statistics. These statistics include percentile values and measures of central tendency (e.g. mean, median and mode), dispersion (standard deviation, variance, range, minimum and maximum) and distribution (skewness and kurtosis) and produce bar charts and histograms.

Page 95 Get to the frequencies dialogue box by clicking on Analyze and selecting Descriptive Statistics from the options. Then select the top option that says Frequencies.... This will bring up the following dialogue box: Figure 5.2

Some other descriptive statistics options available under the Descriptive Statistics sub menu include Descriptives... and Explore....

Figure 5.3

Descriptives... produces summary statistics but not charts and the frequency count. Explore... produces box plots, stem and leaf graphs, histograms and normality plots with tests in addition to descriptive statistics, all by separate groups if required.

Page 96 5.5 Describing data - Tables

For most studies and for RCTs in particular, it is good practice to produce a table or tables that describe the characteristics of your sample. So one of the first analyses that should be carried out with the data from a two group RCT is to summarize the entry or baseline characteristics of the two groups. It is important to show that the groups are similar with respect to variables that may affect the patient’s response.

Continuous variables (such as age, area of ulcer and maximum duration of current ulcer) can be summarised using the mean and standard deviation (SD) or the median and a percentile range, say the interquartile range (25th to 75th percentile). The latter approach is preferable when continuous measurements have an asymmetrical distribution. The standard error (SE) is not appropriate for describing variability. For categorical data (such as gender) and ordered categorical data (such as stages of disease I to IV) the calculation of means and standard deviations is incorrect: instead proportions should be reported. When percentages are given the denominator should always be made clear.

It is much easier to scan numerical results down columns rather than across rows, and so it is better to have different types of information (such as means and standard deviations) in separate columns. The number of observations should be stated for each result in a table.

For continuous variables such as age, area of ulcer and maximum duration of current ulcer it is easier to use the Analyze > Tables > Basic Tables… option to produce a table of summary descriptive statistics such as means, medians, standard deviations and ranges, separately for each group.

Figure 5.4: The Tables menu in SPSS

Page 97 Figure 5.5: Continuous data - The Basic Tables… menu in SPSS

Conversely for categorical variables such as sex, marital status and mobility it is easier to use the Analyze > Tables > General Tables… option to produce a table of summary descriptive statistics for categorical outcomes such the number and proportion in the sample with the value, separately for each group.

Figure 5.6: Categorical data - The General Tables… menu in SPSS

Page 98 Examples of tables in SPSS

Baseline characteristics of the patients recruited to the leg ulcer trial

Clinic Home N Mean Std Deviation N Mean Std Deviation Age (years) 120 73.8 (10.9) 113 73.2 (11.6) Baseline ulcer area cm2 117 16.2 (28.9) 100 16.9 (40.8) Maximum duration of 118 27.5 (53.8) 111 29.7 (82.3) current ulcer (mths) First ulcer began (years 113 14.2 (14.9) 108 12.9 (15.8) ago)

Baseline characteristics of patients recruited to the leg ulcer trial

Group Clinic Home Sex male 43 35 (35.8%) (31.0%) female 77 78 (64.2%) (69.0%) Total 120 113 (100.0%) (100.0%) Marital Married 54 50 status (46.6%) (45.5%) Single 14 11 (12.1%) (10.0%) Div/Sep 7 4 (6.0%) (3.6%) Widowed 41 45 (35.3%) (40.9%) Total 116 110 (100.0%) (100.0%) Baseline Walk freely 52 57 mobility: (43.3%) (50.4%) Walk with aid 66 56 (55.0%) (49.6%) Chair or bedbound 2 (1.7%) Total 120 113 (100.0%) (100.0%)

5.6 Comparing groups - categorical data: Crosstabs

The Crosstab command in SPSS is the technique for producing cross tabulation or frequency table of two variables. As this technique treats the values as categories it is only sensible to use this with categorical data. (E.g. a cross tabulation of group by type of leukaemia for the GvHD data or sex by age category in the cystic fibrosis data). Each cell of the frequency table corresponds to a particular combination of characteristics relating the two classifications. There is a single, general approach to the analysis of all frequency tables. However, in practice the method of analysis varies according to: the number of categories; whether the categories are ordered or not; the number of independent groups of subjects and the nature of the question being asked.

Page 99

To obtain cross-tabulation start with the Analyze menu and choose Descriptive Statistics and then select Crosstabs.... This will bring up the following dialogue box:

Figure 5.7

If you do not click on the Cells… button, you can only get cell counts in your table. If you click on the Cells... button you get the following dialogue box, in which you can select what is displayed in a cell. It is easier to see what is going on by expressing the counts as percentages of either the row or column totals.

Figure 5.8

Page 100

The analysis of frequency tables is largely based on hypothesis testing. The null hypothesis is that the two classifications (e.g. group and leukaemia type) are unrelated in the relevant population (bone marrow transplant patients). We compare the observed counts or frequencies with what we would expect if the null hypothesis were true. We base our calculation of the expected frequencies on the distribution of the variables in the whole sample, as indicated by the row and column totals.

If the null hypothesis was true and the two variables are unrelated (i.e. independent) then the probability of an individual being in a particular row is independent of which column they are in. The probability of being in a particular cell of the table is thus simply the product of the probabilities of being in the row and column containing that cell. These probabilities or expected frequencies are estimated using the observed proportions. The expected frequency in each cell is thus the product of the relevant row and column totals divided by the sum of all the observed frequencies in the table (i.e. the sample size). The appropriate test statistic X2 (called Pearson Chi-square in SPSS) is obtained by calculating the sum of the quantities (O - E)2/E for all the cells in the table, where O and E denote the observed and expected frequencies. When the null hypothesis is true the test statistic X2 has a Chi squared (c2) distribution.

You can choose a Chi squared test by pressing the Statistics... button and selecting Chi- square.

The use of the Chi squared distribution for the test statistic X2 is based on a ‘large sample’ approximation. In order for the Chi squared test to be valid, the guidelines are that 80% of the cells in the table should have expected frequencies greater than five, and all cells should have expected frequencies greater than one.

To improve the approximation for a 2 x 2 table, Yates’ correction for continuity is sometimes applied. (This involves subtracting 0.5 from positive differences between O - E frequencies and adding 0.5 to negative differences before squaring.)

If we have a table with too many small expected frequencies, we should find some sensible way to combine some of the categories in the row and/or column variables. There is also a special method known as Fisher’s exact test for 2 x 2 tables with very small expected frequencies. SPSS calculates Fisher’s exact test if any expected cell value in a 2 x 2 table is less than five (e.g. group by pregnant in the GhVD data).

When we wish to compare frequencies among groups that have an ordering (e.g. sex by age category in the cystic fibrosis data) we should make use of the ordering to increase the power of the analysis. The chi-squared test assesses departure of the observed data from the null hypothesis that the groups are the same, but in no particular manner. When the groups are ordered we usually expect any difference among the groups to be related to the ordering. We can subdivide variation among the groups into that due to a trend in proportions across the groups and the remainder. The Mantel-Haenszel test for linear association yields a test statistic from a Chi-distribution with one degree of freedom, and will always be less than the X2 for the overall comparison. If most of the variation is due to a trend across the groups, then the test for linear association will yield a much smaller P value. In SPSS the Mantel-Haenszel test is labelled as “Linear-by-linear Association” in the Chi Squared Test output.

Although the Mantel-Haenszel statistic is displayed whenever chi-squared is requested, it should not be used for nominal data.

Page 101

A different approach to frequency data from two ordered groups is to treat the data as two samples of observations on an ordinal scale. We can give ranks 1, 2, 3,…, etc. to the ordered groups, and then compare the ranks for the subjects with or without the characteristic of interest using the Mann-Whitney test (described in section 5.5). There are, of course vast numbers of tied ranks in data of this type because there are few different values, so it is essential to use the version of the test with a correction for ties. In general the Mann-Whitney test gives a very similar answer to the Mantel-Haenszel test for linear association.

It is essential to realise that an observed association does not necessarily indicate a causal relation between variables. The size of X2 (or P) does not indicate the strength of the association, but rather the strength of evidence against the null hypothesis of no association.

5.7 Comparing groups - continuous data: two independent samples t test

The most common statistical analyses are probably those used for comparing two independent groups of observations. The independent samples t test is used to test the null hypothesis that the two group population means are equal. The test statistic is obtained from the mean difference divided by the standard error of the difference and compared with the t distribution with the appropriate degrees of freedom.

The two independent samples t-test assumes: · Two ‘independent’ groups; · Continuous outcome variable; · Outcome data in both groups is Normally distributed; · Outcome data in both groups have similar standard deviations.

We can check that the conditions are met by:

• Plotting two histograms, one for each group to assess Normality • it does not have to be perfect, just roughly symmetric • Calculating standard deviations - one should be no more than twice the other.

However the t-test is very robust to violations of the assumptions of Normality and equal variances, particularly for moderate to large sample sizes.

Note that you can check these assumptions in SPSS by selecting Analyze, Descriptive Statistics, Explore... and then the Plots… button choosing the Histogram and Normality Plots with tests options (Figures 5.4 and 5.5).

Page 102 Figure 5.9: Using Explore to examine the data and check the assumptions for a t-test

Figure 5.10: Checking the assumptions for a t-test with histograms and Normal plots

If we believe the assumptions have been satisfied then we can now proceed with a two- independent sample t-test, to test the equality of means in the two groups.

For example, in the cystic fibrosis dataset, do the male and female patients have similar mean ages, mean weights and mean heights?

Page 103 To do this select the Compare Means from the Analyze menu. Then from the submenu that comes up select Independent-Samples T Test.... This will bring up the following dialogue box:

Figure 5.11

(In our example the grouping variable is SEX coded 0 and 1).

Once you have selected the Grouping variable you need to click on Define Groups... to bring up the following dialogue box:

Figure 5.12

Page 104

If you do not do this, SPSS does not know how the two groups are defined by the grouping variable, and thus will not be able to carry out a t test.

Figure 5.13: SPSS output from the two independent samples t-test

Group Statistics

Std. Error Group N Mean Std. Deviation Mean Leg ulcer free Clinic 120 20.0821 18.49067 1.68796 time (weeks) Home 113 14.2035 17.56148 1.65204

Independent Samples Test

Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Mean Std. Error Difference F Sig. t df Sig. (2-tailed) Difference Difference Lower Upper Leg ulcer free Equal variances 1.632 .203 2.485 231 .014 5.87860 2.36555 1.21779 10.53942 time (weeks) assumed Equal variances 2.489 230.982 .014 5.87860 2.36188 1.22503 10.53218 not assumed

Interpretation of the output I usually ignore Levene’s Test for equality of variances and look at the p-value from Sig. (2 tailed) Equal variances assumed. This gives us the probability of observing the test statistic (t = 2.485) or more extreme under the null hypothesis i.e. p = 0.014.

What does P = 0.014 mean? Your results are unlikely when the null hypothesis is true.

Is this result statistically significant? The result is statistically significant because the P-value is less than the significance level (a) set at 5% or 0.05.

You decide? - That there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis that there is a difference in mean ulcer free weeks between the Intervention and Control Groups.

A better test is to assume the variances are not equal and use the unequal variances t-test. This uses a slightly different Standard Error and degrees of freedom.

5.8 Non-parametrics - Mann-Whitney U test

The use of the independent samples t test is based on the assumption that the data for each group have an approximately Normal distribution and similar variances. When at least one of these requirements is not met we can either try a transformation of the data (e.g. the logarithm transformation) or use a non-parametric method. Non-parametric or distribution free methods do not involve distributional assumptions.

Page 105 There is a non-parametric alternative to the t test for comparing data from two independent groups. There are two derivations of the test, one due to Wilcoxon and the other to Mann and Whitney. It is better to call the method the Mann-Whitney U test (as SPSS does) to avoid confusion with the paired test due to Wilcoxon.

The Mann-Whitney test requires all the observations to be ranked as if they were from a single sample. We can now use two alternative test statistics, U and W. The statistic W (due to Wilcoxon) is simply the sum of the ranks in the smaller group (SPSS takes the first group if they are the same size). The statistic U (due to Mann and Whitney) is more complicated. U is the number of all possible pairs of observations comprising one from each sample for which the value in the first group precedes a value in the second group.

The Mann-Whitney U test is obtained by selecting Analyze, Nonparametric Tests, 2 Independent Samples....

Figure 5.14

Figure 5.15

Page 106

Figure 5.16: SPSS output from Mann-Whitney U test

NPar Tests Mann-Whitney Test

Ranks

Group N Mean Rank Sum of Ranks Leg ulcer free Clinic 120 126.87 15224.00 time (weeks) Home 113 106.52 12037.00 Total 233

Test Statisticsa

Leg ulcer free time (weeks) Mann-Whitney U 5596.000 Wilcoxon W 12037.000 Z -2.388 Asymp. Sig. (2-tailed) .017 a. Grouping Variable: Group

Interpretation of the output The probability of observing the test statistic (Z = -2.388) or more extreme under the null hypothesis is p = 0.017.

What does P = 0.017 mean? Your results are unlikely when the null hypothesis is true.

Is this result statistically significant? The result is statistically significant because the P-value is less than the significance level (a) set at 5% or 0.05.

You decide? - That there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis that there is a difference in ulcer free weeks between the Intervention and Control Groups.

5.9 Comparison of two paired groups - continuous outcome data: paired t test

When we have more than one group of observations it is vital to distinguish the case where the data are paired from that where the groups are independent. Paired data arise when the same individuals are studied more than once, usually in different circumstances.

For example, in the leg ulcer dataset, HRQoL was assessed three times: at baseline, 3 months and 12 months follow-up.

Also, when we have two different groups of subjects who have been individually matched, for example in matched pair case-control study, then we should treat the data as paired.

Another common statistical analysis involves the comparison of the response of one group under different. For example from the leg ulcer trial we may be interested in answering the following research question:

Page 107

Is there a change in HRQoL between baseline and three months follow-up in those patients whose leg-ulcer had healed at three months?

Note that in this case the two groups are paired and not independent (measurements are made on the same individuals at baseline and 3 months). Therefore, we are interested in the mean of the differences not the difference between the two means (not independent groups), i.e. the mean of (ghp0 - ghp3). We call this the paired difference, di. SPSS calculates this value for us automatically when we carry out a paired t-test.

Assumptions for paired t-test The di’s are plausibly Normally distributed. Note it is not essential for the original observations to be Normally distributed).The di’s are independent of each other. Again, we can check the assumption of Normality in SPSS using the Analyze, Descriptive Statistics, Explore... and then the Plots… button choosing the Histogram and Normality Plots with tests options. However, first of all, we must calculate a new variable for the differences, say dghp using the Analyze, Compute Variable menu.

There is a non-parametric alternative to the paired t test called the Wilcoxon signed rank sum test. The Wilcoxon paired test is obtained by selecting Analyze, Nonparametric Tests, Two- Related Samples Test..

Figure 5.17: Computing a new variable for the paired difference

Page 108 Figure 5.18: Selecting only those patients whose leg ulcer had healed by 3 months

Figure 5.19: Selecting only those patients whose leg ulcer had healed by 3 months

Page 109 Figure 5.20: Paired t-test option in SPSS

Figure 5.21: The paired t-test menu

Page 110 Figure 5.22. Output from T-Test procedure

Paired Samples Statistics

Std. Error Mean N Std. Deviation Mean Pair SF-36 GENERAL 1 HEALTH DIMENSION: 66.2778 36 18.83251 3.13875 baseline SF-36 GENERAL HEALTH DIMENSION: 58.9444 36 21.95052 3.65842 3 months

Paired Samples Correlations

N Correlation Sig. Pair SF-36 GENERAL 1 HEALTH DIMENSION: baseline & SF-36 36 .681 .000 GENERAL HEALTH DIMENSION: 3 months

Paired Samples Test

Paired Differences 95% Confidence Interval of the Std. Error Difference Mean Std. Deviation Mean Lower Upper t df Sig. (2-tailed) Pair SF-36 GENERAL 1 HEALTH DIMENSION: baseline - SF-36 7.33333 16.53222 2.75537 1.73963 12.92703 2.661 35 .012 GENERAL HEALTH DIMENSION: 3 months

Interpretation of the output

What does P = 0.012 mean? Your results are unlikely when the null hypothesis is true.

Is this result statistically significant? The results is statistically significant because the P-value is less than the significance level (a) set at 5% or 0.05.

You decide? - That there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis that there is a difference or change in mean HRQoL between baseline and 3 months follow-up in patients whose leg ulcer had healed by 3 months.

5.10 Correlations

To see if two continuous variables (such as height and weight) are related or associated a correlation coefficient is used. To obtain this in SPSS select from the Analyze menu Correlate.

Page 111 Figure 5.23

Then from the sub-menu we select Bivariate..., which will bring up the following dialogue box:

Figure 5.24

When performing a hypothesis test or constructing a confidence interval for the Pearson (r) correlation coefficient it is preferable that both variables have an approximate Normal distribution in the population. If the data do not have a Normal distribution either one or both of the variables can be transformed, or a non-parametric correlation coefficient can be calculated. Spearman (rs) is a non-parametric version of the Pearson (r) correlation coefficient and Kendall’s tau-b (tb) is a non-parametric measure of association for ordinal (ordered categorical) variables.

Page 112 All these correlation coefficients take values between -1 and +1, with the sign indicating the direction of the relationship and the numerical magnitude its strength. Values of -1.0 or +1.0 indicate that the sample values fall on a straight line. A value of zero indicates the lack of any linear relationship between the two variables.

It is also useful to show a scatter diagram of the data (from the Graphs menu select the option Scatter...).

Select ufw and ghp12 as the variables from the list and since the Pearson correlation coefficient is already ticked by default, click on OK. This leads to the following output.

Figure 5.25: Correlations

Correlations

SF-36 GENERAL HEALTH DIMENSION: Leg ulcer free 12 months time (weeks) SF-36 GENERAL Pearson Correlation 1 .089 HEALTH DIMENSION: Sig. (2-tailed) . .271 12 months N 155 155 Leg ulcer free time Pearson Correlation .089 1 (weeks) Sig. (2-tailed) .271 . N 155 233

Interpretation

The estimated Pearson correlation coefficient r, is 0.089.

What does P = 0.27 mean? Your results are likely when the null hypothesis is true.

Is this result statistically significant? The results is not statistically significant because the P-value is greater than the significance level (a) set at 5% or 0.05.

You decide? - That there is insufficient evidence to reject the null hypothesis and therefore you accept the null hypothesis that there is no correlation between HRQoL at 12 month follow-up and ulcer free weeks.

When performing a hypothesis test or constructing a confidence interval for the Pearson (r) correlation coefficient it is preferable that both variables have an approximate Normal distribution in the population. If the data do not have a Normal distribution either one or both of the variables can be transformed, or a non-parametric correlation coefficient can be calculated. Spearman (rs) is a non-parametric version of the Pearson (r) correlation coefficient and Kendall’s tau-b (tb) is a non-parametric measure of association for ordinal (ordered categorical) variables.

To obtain this in SPSS select from the Analyze menu Correlate, Bivariate then under the Correlation Coefficients option tick either the Kendall’s tau-b (tb) box or the Spearman box.

Page 113 5.11 Others

SPSS also has more sophisticated statistical techniques including: Analysis of Variance (ANOVA), multiple linear regression, multiple , log linear analysis, multivariate analysis of variance, survival analysis (Kaplan-Meier, Actuarial & Cox regression), time series analysis and multivariate methods.

6 Presentation of results

Adequate description of the data should precede and complement formal statistical analysis. This can be achieved by graphical methods (such as scatter plots or histograms), or by using summary statistics. Continuous variables (such as age, weight, height) can be summarised using the mean and standard deviation (SD) or the median and a percentile range (say, the inter quartile range: 25th to 75th percentile). For ordered qualitative data (such as stage of disease I to IV) the calculation of means and standard deviations is incorrect; instead proportions should be reported.

If a t test has been used then the standard deviation (SD) of the data in each group should be given. If a paired t test is used the standard deviation of the differences between groups should be quoted. In addition it may be useful to construct one or more confidence intervals for means or differences between means.

For data analysed by a non-parametric method the median and selected centiles (e.g. 10th and 90th) should be given for each group if the raw data are not shown. For small samples the median and range can be given. For all analyses, it is good practice to quote the test statistic (t, F or c2) as well as the P value derived from it. It should always be clear what the degrees of freedom are.

When presenting results for frequency data it is preferable to give all observed frequencies together with rounded percentages or proportions. It is useful to give percentages to allow a quick visual appraisal of variation among groups of varying size. The test statistic c2, the degrees of freedom and the P value should all be quoted when reporting Chi squared tests.

When presenting correlation results it is useful to show a scatter diagram of the data. The number of observations should be stated, the value of correlation coefficient, r should be given to two decimal places, together with the P value if a test of significance is performed.

P values are conventionally given as < 0.05, < 0.01, or < 0.001, but there is no reason other than familiarity for using these particular values. Exact P values (to no more than two significant figures) such as P = 0.18 or 0.03, are more helpful. It is unlikely to be necessary to specify levels of P lower than 0.0001. Calling any value with P > 0.05 “not significant” is not recommended, as it may obscure results that are not quite statistically significant but do suggest a real effect. P values given in tables need not be repeated in the text.

When presenting means, standard deviations, and other statistics bear in mind the precision of the original data. Means should not normally be given to more than one decimal place more than the raw data, but standard deviations or standard errors may need to be quoted to more than one extra decimal place. It is rarely necessary to quote percentages to more than one decimal place, and even one decimal place is often not needed. With samples of less than 100 the use of decimal places implies unwarranted precision and should be avoided. It is sufficient to quote values of t, F, r and c2 to two decimal places.

For the cystic fibrosis data an example of the textual presentation of the results might read:

Page 114

The mean age of the sample of 14 male cystic fibrosis patients was 15.2 years (SD 5.9) compared with a mean age of 13.5 (SD 3.8) years for the sample of 11 female cystic fibrosis patients. The difference between the sample mean ages of males and females with cystic fibrosis was 1.7 years, with a 95% confidence interval from -2.6 to 5.9 years; the t test statistic was 0.81, with 23 degrees of freedom and an associated P value of 0.43.

6.1 Presentation of results: Numerical Precision

Spurious precision adds no value to a paper and even detracts from its readability. Results obtained from a computer should be rounded. When presenting the means, standard deviations and other statistics the author should bear in mind the precision of the original data. Means should not normally be given to more than one decimal place more than the raw data, but standard deviations or standard errors may need to be quoted to one extra decimal place. It is rarely necessary to quote percentages to more than one decimal place, and even one decimal place is often not needed. With samples of less than 100 the use of a decimal place implies unwarranted precision and should be avoided.

7 Further reading

For more detail on SPSS for Windows please consult the following software manuals:

SPPS for Windows Base System User’s Guide Release 11.0. SPSS for Windows Professional Statistics Release 11.0 SPSS for Windows Advanced Statistics Release 11.0 SPSS 11.0 Syntax Reference Guide.

The latest news on software developments from SPSS is available on the Internet at the following website: http://www.spss.com/ .

A good book specifically about SPSS:

Argyrons G. (2000) Statistics for Social & Health Research with a guide to SPSS. Sage Publications, London.

Page 115 Displaying and tabulating data lecture handout

Aims & Objectives •Be aware of Tufte’sprinciples for Displaying and displaying data (Tufte, 1983) Tabulating Data •Understand about different types of data •Gain knowledge of basic summary Dr Jenny Freeman statistics University of Sheffield •Gain knowledge of basic graphical techniques •Develop awareness of good practice when tabulating and graphing data and when designing Powerpointslides

Content Why is this course necessary? •All the medical sciences, physiology, •Tufte’sprinciples biochemistry, psychology etc. use •Types of data quantitative data. It is important to be able to manipulate and summarise these data •Displaying data using charts correctly •Summary measures •For research to be worthwhile, data must •Summarising data using tables be presented meaningfully and correctly •Guidelines for good practice including for interpreted Powerpointpresentations

Tufte’sPrinciples Dot plot of height for leg ulcer patients 2.0

•Above all else show the data 1.9

•Maximise the data ink ratio, within 1.8

reason 1.7

•Erase non-data-ink, within reason 1.6 •Erase redundant data-ink 1.5 1.4 H e i g h t n m r es •Revise and edit Men (n=77) Women (n=145)

Page 116 Introduction to displaying Introduction to displaying data data •When deciding how to display data ask •Know your audience yourself two questions •Decide what you want to present. Are you − What is it that I want to say? presenting data or are you presenting − Does my chosen method/chart/table actually results? show what I want it to show? •Tables are good for quantification •Charts are good for illustrating specific points

Types of Data: Categorical Types of Data: Quantitative/numerical •Count (can only take certain values) •Nominal (no natural ordering) – Number of children in family – Blood group – Gender •Continuous (limited only by accuracy of •Ordered categorical instrument) – Height in cm – Grades of breast cancer – Weight in kg – Better, same, worse

Exercise 1 The dataset • RCT of cost effectiveness of community leg ulcer Classify the following data as quantitative (discrete / clinics. continuous) or qualitative (nominal / ordinal / binary): • 233 patients with venous leg ulcers randomly •Age allocated to either usual care at home by district nursing team (control group, n=113)) or weekly •Marital status treatment with four layer bandaging in a specialist leg •Blood pressure ulcer clinic (intervention group, n=120). • Outcomes of interest include relative costs for each •Number of visits to GP per year group, time to complete ulcer healing, patient health •Number of decayed, status, and recurrence of ulcers, satisfaction with care and use of services. missing or filled teeth •Cholesterol level Morrell C J, Walters S J, Dixon S, Collins K A, Brereton L M L, Peters J, BrookerC G D. (1998) Cost effectiveness of community •Study Group leg ulcer clinics: randomised controlled trial British Medical Journal 316 1487-1491.

Page 117 Displaying categorical data Displaying and • For categorical variables such as sex and marital statusit is straightforward to present the number in Summarising Categorical each category or express it as a percentage of the Data total number of patients. Suppose that we want to check that • Can use either a bar chart or a pie chart to display these data graphically. similar numbers of patients have been • Always give sample sizes randomised to the control and intervention group. • Avoid 3-D charts • Only use pie charts when the number of categories is low (< 5)

Bar chart of marital status for the leg ulcer patients Summarising Categorical (n=233)50 data The distribution of patients between the two study 40 groups:

n % 30 Intervention 120 51.5 120 x 100% = 51.5% 233 20

Control 113 48.5 10 Total 233 100

0 P e r c ent Missing Married Single Div/Sep Widowed

Marital status

Example of 2-D versus 3-D bar chart Example of 3-D bar chart, with patterns (definitely not recommended) 50 50 50

40 40 40

30 30 30

20 20 20

10 10

10

0 0 P e r c nt P e r c nt Missing Married Single Div/Sep Widowed Missing Married Single Div/Sep Widowed 0 P e r c nt Marital status Marital status Missing Married Single Div/Sep Widowed Figure1: 2-D bar chart (recommended) Figure 2: 3-D bar chart (not recommended) Marital status

Page 118 One final modification: order categories by largest first Stacked barchartshowing relationship between maternal age and breastfeeding 50

40

30 P e r c nt

20

Use of evidence based leaflets to promote informed choice in 10 maternity care: randomised controlled trial in everyday practice. AO'Cathain, S JWalters, J PNicholl, K JThomas, 0 Married Widowed Single Div/Sep Missing MKirkham. BMJ 2002;324:643-647

Pie chart showing marital status for the leg ulcer patients Displaying quantitative data • Stem & leaf plots • Dotplots • Histograms • Box & whisker plots • Scatterplots n=233

Heights of male leg ulcer Stem & leaf plot for heights of patients (n=76) male leg ulcer patients Frequency Stem & Leaf

187 177 193 172 185 175 177 177170 165 1.00 Extremes (=<1.57) 172 177 177177177172 177 177190 177 3.00 16 . 222 4.00 16 . 5577 172 182 185 182 172 182 177 182 170 157 18.00 17 . 000000222222222222 172 172175 182 175 185 187 187187172 24.00 17 . 555557777777777777777777 15.00 18 . 000000002222222 172177 187 180 167 170 170182 170 162 10.00 18 . 5555777777 1.00 19 . 0 162185 177 177180 180177 172 180 180 1.00 Extremes (>=1.93) 177 180 175 177 177167 182 165 187 180 Stem width: 0.10 177 172 172175 170 180 Each leaf: 1 case(s)

Page 119 Heights of male leg ulcer Stem & leaf plot for heights of patients (n=76) male leg ulcer patients Frequency Stem & Leaf 187 177 193 172 185 175 177 177170 165 0.00 16 . 172 177 177177177172 177 177190 177 0.00 16 . 172 182 185 182 172 182 177 182 170 157 0.00 17 . 0.00 17 . 172 172175 182 175 185 187 187187172 0.00 18 . 172177 187 180 167 170 170182 170 162 1.00 18 . 7 0.00 19 . 162185 177 177180 180177 172 180 180

177 180 175 177 177167 182 165 187 180 Stem width: 0.10 177 172 172175 170 180 Each leaf: 1 case(s)

Heights of male leg ulcer Stem & leaf plot for heights of male leg patients (n=76) ulcer patients Frequency Stem & Leaf 187177 193 172 185 175 177 177170 165

172 177 177177177172 177 177190 177 0.00 16 . 172 182 185 182 172 182 177 182 170 157 0.00 16 . 0.00 17 . 172 172175 182 175 185 187 187187172 1.00 17 . 7 172177 187 180 167 170 170182 170 162 0.00 18 . 162185 177 177180 180177 172 180 180 1.00 18 . 7 0.00 19 . 177 180 175 177 177167 182 165 187 180 177 172 172175 170 180 Stem width: 0.10 Each leaf: 1 case(s)

Heights of male leg ulcer Stem & leaf plot for heights of male leg patients (n=76) ulcer patients Frequency Stem & Leaf 187177 193 172 185 175 177 177170 165 172 177 177177177172 177 177190 177 0.00 16 . 0.00 16 . 172 182 185 182 172 182 177 182 170 157 0.00 17 . 172 172175 182 175 185 187 187187172 1.00 17 . 7 0.00 18 . 172177 187 180 167 170 170182 170 162 1.00 18 . 7 162185 177 177180 180177 172 180 180 1.00 19 . 3 177 180 175 177 177167 182 165 187 180 Stem width: 0.10 177 172 172175 170 180 Each leaf: 1 case(s)

Page 120 Stem & leaf plot for heights of Dot plot of height for leg ulcer patients male leg ulcer patients Frequency Stem & Leaf

1.00 Extremes (=<1.57) 3.00 16 . 222 4.00 16 . 5577 18.00 17 . 000000222222222222 24.00 17 . 555557777777777777777777 15.00 18 . 000000002222222 10.00 18 . 5555777777 1.00 19 . 0 1.00 Extremes (>=1.93)

Stem width: 0.10 Each leaf: 1 case(s)

Histogram of height for the leg Histograms of height, by sex for ulcer patients, sexes combined the leg ulcer patients MMenen ( n(=n77)=76) WoWomemen ( n(=n145)=142) • No spaces between 60 bars –distinguish 50 50

from bar-chart. 50 40 40 • Use equal sized 40 intervals. 30 30

• Number of intervals 30 (bins) should be 20 20 between 5 and 15, so 20 can display ‘shape’ 10 10 10

without ‘noise’. 0 0 F r equen cy F r equen cy 0 1.43 1.50 1.57 1.63 1.70 1.77 1.83 1.90 1.97 1.43 1.50 1.57 1.63 1.70 1.77 1.83 1.90 1.97 F r equency • Always give sample 1.44 1.50 1.56 1.63 1.69 1.75 1.81 1.88 1.94 size. Height in metres Height in metres Height in metres

Scatterplot Figure : Relationship between teenage pregnancy rates and a Table of systolic blood pressure levels for 16 composite deprivation score - Local Authorities 1999-2001 patients before and after exercise session: 80 Subject Systolic blood pressure (mmHg) DIFFERENCE Number BEFORE AFTER After-Before 1 148 152 4 2 142 152 10 60 3 136 134 -2 4 134 148 14 5 138 144 6 40 6 140 136 -4 7 132 144 12 8 144 150 6 9 128 146 18 20 10 170 174 4 11 162 162 0 R a t e ( pe r 1000 w o m en aged 15 - 17) 12 150 162 12 13 138 146 8 0 0 1000 2000 3000 4000 5000 6000 7000 8000 14 154 156 2 15 126 132 6 Composite deprivation score (higher values indicate increased deprivation) 16 116 126 10 Ref: www.empho.org.uk/whatsnew/teenage-pregnancy-presentation.ppt

Page 121 Box and whisker plot of height for Exercise 2 the leg ulcer patients • The box illustrates the interquartile range Display the systolic blood pressure of and thus contains the middle 50% of the data. the sample of 16 study participants • The median is shown by the horizontal before exercise using: line across the box. • The whiskers extend to the largest & smallest values excluding the outlying values. The outlying values are those values more than 1.5 box lengths from (a) Stem & leaf plot the upper or lower edges. Those observations between 1.5 and 3 box (b) Dot plot lengths from upper or lower edges of the box are outliers, whilst those more than 3 box lengths away are called extreme values. • Very useful when comparing several sets of data.

Example of CONSORT style flow diagram of patient numbers Summarising Numerical Data •Graphs are a useful starting point as they give us a ‘feel’for the data and show how the data is distributed.

•Need to summarise data and we are often interested in: • What’s the average value?

Taken from ‘Costs and • What’s the spread of the data? effectiveness of community postnatalsupportworkers: randomised controlled trial. Morrell et al, BMJ 2000; 321:593- •A measure of location (average) and variability 598’ (spread) provides an informative but brief summary of a set of observations.

Measures of location Measures of location: Mode Mode Most common observation

Median Middle observation, when the data are •The simplest measure of location is arranged in order of increasing value the mode which is simply the most common value observed: If have an even number of observations, e.g. if we take 50 results, the midpoint falls between the 25th and 26th, the – e.g. for the BP data the mode = 138 mmHg. median is calculated as the average of the two middle observations.

Mean Sum of all observations Number of observations

Page 122 Measures of location: Median Calculating the median for •Order the observations -median is the middle the blood pressure data observation. BP mmHg Rank • As the number of observations 116 1 is even (n=16). 126 2 128 3 132 4 •Odd numbers of observations will have a 134 5 unique median. • The median is the average of 136 6 the two central values (the 8th 138 7 and 9th). 138 8 140 9 •When there is an even number of 142 10 • So the median blood pressure 144 11 observations there is strictly no middle 148 12 before exercise is observation -take the mean of the two middle 150 13 (138+140)/2=139 mmHg. 154 13 observations. 162 15 170 16

Measures of location: Mean Pros and cons of mean/ median/ mode Given n observations :x1, x2,…..., xn n •Median robust to outliers. x •Median/mode reflects what ‘most’people x1+x2++... xn or åi =1 i x= x = experience. n n •Mean uses all the data (more ‘efficient’). •Mean is ‘expected’value. 148+142+136+...+116 2258 = =141.1mmHg •Mean more common with statistical tests. 16 16 •Mode useful for grouped or categorical data.

Quantifying Variability: Measures of spread measures of spread Range minimum observation to maximum observation •Need a numerical way of summarising the Interquartile range observation below which the amount of variability in a data set. bottom 25% of data lie and the observation above which •Three main approaches to quantifying the the top 25% of data lie variability: NB: If value falls between two observations, eg if 25th centile falls between 5th and 6th observations then the value is calculated as the – Range average of the two observations (this is the same principle as for the median). – Inter quartile Range – Standard deviation Standard deviation (SD) Average distance of the observations from the mean value ( NB: Variance = SD squared)

Page 123 Range Inter quartile range •Simplest way to describe the spread of a data •Split the data set into four equal parts - quartiles Lower quartile (25th centile or Q1) set is to quote the minimum (lowest) and Median (50th centile or Q2) maximum (highest) value. Upper quartile (75th centile or Q3)

•Inter quartile range (IQR) tells you where the middle • e.g. The range for the BP data was 116 to 170 50% of your data lies. mmHg or as a single number 54mmHg IQR = upper quartile -lower quartile

•Effected by extreme values at each end of the •Graphical way of summarising data usingpercentiles data. is the box & whisker plot.

Variance Standard deviation • Based on the idea of averaging the distance each value is away from the mean ì. •The variance is not a suitable measure for describing • For an individual with an observed value xi the variability because it is not in the same units as the distance from the mean is x -ì. i raw data (We don’t want the variability of our set of • With n observations we have a set of n such blood pressure measurements expressed in square differences, one for each individual. mmHg). • The sum of these distances, S(xi -ì) is always zero, however if we square the distances before we sum •The solution is to take the square root of the variance them we get a positive quantity. -the standard deviation (usually abbreviated to SD • The average of these squared differences thus gives a or s or s) defined as: measure of deviation from the mean. • This quantity is called the variance, and is defined as: n n 2 2 x - x å ( x i - x ) å ( i ) i = 1 s = i =1 n - 1 n - 1

Calculating the inter quartile Calculation of variance and standard range for the blood pressure data deviation for blood pressure data Blood PressureDifferencesSquare of Subject (mmHg)from Meandifferences • When the quartile lies BP mmHg Rank from mean 116 1 1 148 6.88 47.27 between 2 observations 2 142 0.88 0.77 126 2 easiest option is to take the 3 136 -5.13 26.27 128 3 4 134 -7.13 50.77 mean (there are more 132 4 5 138 -3.13 9.77 complicated methods). 134 5 6 140 -1.13 1.27 7 132 -9.13 83.27 136 6 8 144 2.88 8.27 138 7 9 128 -13.13 172.27 • Lower quartile is 133 mmHg. 138 8 10 170 28.88 833.77 140 9 11 162 20.88 435.77 12 150 8.88 78.77 142 10 13 138 -3.13 9.77 • Upper quartile is 149 mmHg. 144 11 14 154 12.88 165.77 148 12 15 126 -15.13 228.77 150 13 16 116 -25.13 631.27 Totals (Sum) 2258 0 2783.75 154 13 • IQR is 133 to 149 mmHg or as 162 15 Mean 141.13mmHg Variance 185.58 a single number 16mmHg. 170 16 mmHg2 Standard Deviation 13.62 mmHg

Page 124 Why use mean and SD? The Normal distribution • Bell shaped and •For many variables in health sciences symmetrical. • 68% of the – mean ± SD covers 68% of the distribution. observations lie within 1 SD of the mean. – mean ±2SDs covers about 95% of the • About 95% of the distribution. observations lie within 2 SDs of the mean. •The mean ±2SDs is called the ‘normal • Mean and medians range’or the reference range. will coincide.

Positively skewed distribution Negatively skewed distribution M ean M ode M ed i an M e d i an M ode M ean

Histogram of age, Histogram of weight, showing negative skew showing positive skew Choosing appropriate 50 50 summary measures 40 40 •Choosing the most appropriate summary measures for a set of observations depends on 30 30 the shape of their distribution.

20 20 •If symmetrical use the mean and standard deviation.

10 10 •If skewed the median and inter quartile range is 0

F r e qu n cy 0 F re q u en cy 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 90.0 95.0 45.9 52.8 59.7 66.6 73.4 80.3 87.2 94.1 100.9 107.8 114.7 121.6 128.4 135.3 142.2 149.1 more appropriate, as they are less influenced by the extreme values. Age (years) Weight in kilograms

Page 125 Tufte’sPrinciples •Above all else show the data And now, what of good practice? •Maximise the data ink ratio, within reason ………. •Erase non-data-ink, within reason •Erase redundant data-ink •Revise and edit

Guidelines for good practice when Guidelines for good practice when constructing figures (1) constructing figures (2) • The amount of information should be maximised for • Limit the use of colour –this is particularly the minimum amount of ink important if your original is to be photocopied • Figures should have a title clearly explaining what is • Gridlines should be kept to a minimum, use only being displayed enough to aid interpretation • Axes should be clearly labelled and round numbers on axes effectively • Never use 3-D charts as these can be difficult to read • Always start at zero when graphing absolute numbers or standard barcharts • The number of observations should be included • Ask yourself would a table be better?

Example 1 Example 1 0.25 Boys continued Girls

0.25 Boys 0.25 Boys 0.20 Girls Girls

0.20 0.20 0.15 0.15 0.15

0.10 0.10 0.10 Lo w c o n t r a s u i ty L o w c on t r a s u i ty 0.05 0.05

L o w contr a s t cuity 0.05

0.00 0.00

0.00

Baseline 30 Days 90 Days Baseline 30 Days 90 Days

Baseline 30 Days 90 Days

Page 126 Example 2: data taken from ‘Presenting Example 2 continued data’ Local Government unit Wales, p17 Graph C

Graph A Graph B 35

30

25

20

15 T h o u s a nds

10

5

1993 1994 1995 1996 1997 1998 1999 2000 2001 1993 1994 1995 1996 1997 1998 1999 2000 2001 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 Year

Guidelines for presenting numerical Guidelines for presenting numerical information (1) information (2) • A far as possible numerical precision should be consistent • Test statistics such as values of t or χ2 and correlation throughout a paper or presentation coefficients should be given to no more than two decimal places • In general summary statistics such as the mean should not be given to more than one decimal place more than the raw • Confidence intervals are better presented as, say 12.4 to 52.9 data because the format 12.4-52.9 is confusing when one or both numbers are negative • This applies also to measures of variability or uncertainty such as the standard deviation or standard error, though • P-values should be given to one or two significant figures, greater precision may be warranted for further calculations even for non-significant results as these may conceal important information. • Categorical data can be summarised as frequencies and percentages. It can be confusing giving percentages alone as Taken from Altman & Bland, BMJ, 312: pp 572. 1996 the denominator may be unclear Taken from Altman & Bland, BMJ, 312: pp 572. 1996

Guidelines for good practice when Guidelines for good practice when constructing tables (1) constructing tables (2) • The amount of information should be maximised for • Table layout should be such that it makes comparing the minimum amount of ink –keep tables as simple as relevant groups easy. It is easier to read down columns possible than across and so where possible it is best to put groups to be compared in rows • Tables should have a title clearly explaining what is being displayed • Gridlines should be kept to a minimum as they can interrupt the flow of information –use space to • The number of observations should be included separate data not lines • Order by columns and rows by size (unless there is an inherent order to the groups eggrades of cancer) • Show row totals to the right and column totals to the bottom of the table • It is easier to read down columns than across and so where possible it is best to put groups to be compared • Round data in summary tables in rows • Ask yourself would a figure be better? • Right justify numbers in columns

Page 127 Presenting data in tables: a Results: what do you think? made-up example (1) Visit 1 Visit 2 Visit 3 Visit 4 Visit 5 •Parallel group trial •Drug vsplacebo Drug Mean 50.4 47.8 47.9 46.1 45.0 •Assess change in dementia ‘SAG’ rating scale SD 3.91 3.88 3.90 4.07 3.94 •Five visits, at 3 month intervals Placebo Mean 49.4 41.8 33.2 27.5 22.6 •100 patients per group SD 3.76 4.14 4.46 4.23 4.93

Taken from www.sbtc.ltd.uk

Results: what do you think? Results: now what do you think? •Too many lines –cluttered Drug (n=100) Placebo (n=100) •Hard to identify structure Mean (SD) Mean (SD) •Title uninformative Visit Baseline 50 (3.9) 49 (3.8) •Not clear what numbers mean Month 3 48 (3.9) 42 (4.1) •Decimal places add to clutter Month 6 48 (3.9) 33 (4.5) Month 9 46 (4.1) 28 (4.2) •Hard to identify trends Final 45 (3.9) 23 (4.9) •Visits can be clarified further •No sample size

Example 2: Reorganising the table Example 2: Number of unemployed by state:1971-1974. Taken from ‘A primer in data reduction’, ASC Ehrenberg UNEMPLOYED (‘000) 1982, Wiley 1971 1972 1973 1974 Average California 740 650 610 670 670 UNEMPLOYED Illinois 240 250 200 220 230 Number (1,000) As % of workforce Florida 130 120 130 210 150 1971 1972 1973 1974 1971 1972 1973 1974 Indiana 130 100 100 120 110 Connecticu 120 120 Alabama 75 62 62 78 5.5 4.5 4.5 5.5 90 90 105 t Alaska 12 13 14 15 10.5 10.5 10.8 10.5 76 83 Arizona 32 32 34 49 4.7 4.2 4.1 5.6 81 110 88 Georgia 75 62 Arkansas 40 36 34 40 5.4 4.6 4.1 4.8 62 78 69 Alabama 34 44 59 62 50 California 737 652 615 670 8.8 7.6 7.0 7.3 D.C. 37 35 36 43 38 Colorado 37 35 36 43 4.0 3.6 3.4 3.9 Colorado 40 36 34 40 38 Connecticut 116 121 89 88 8.4 8.6 6.3 6.1 Arkansas Delaware 13 11 12 15 5.7 4.7 4.6 6.1 33 32 34 49 37 D.C. 34 44 59 62 2.7 3.3 4.2 4.4 Arizona 21 25 24 27 24 Florida 135 127 132 208 4.9 4.5 4.3 6.3 Hawaii 19 20 19 22 20 Georgia 76 83 81 109 3.9 4.1 3.9 5.1 Idaho 12 13 14 15 14 Hawaii 21 25 24 27 6.3 7.3 7.0 7.6 Alaska 13 11 12 15 13 Idaho 19 20 19 22 6.3 6.2 5.6 6.1 Delaware Illinois 240 245 203 223 5.1 5.1 4.1 4.5 Average 110 110 100 120 110 Indiana 128 103 101 123 5.7 4.5 4.3 5.2

Page 128 Figure or Table? Checklist • The choice is not always obvious •Tables • Tables are suitable for information about large numbers of – Title variables at once – Numerical precision • Figures are good for showing multiple observations on individuals or groups – Sample size • Figures can be particularly useful when conveying – Gridlines information to an audience •Graphs • One point to consider when considering using a figure is the amount of numerical information contained. A figure – Title displaying only two means with their standard errors or – Label axis confidence intervals is a waste of space as a figure and either more information should be included or the – Report sample size summary values should be put in the text – Gridlines

One final note: what about slides? One final note: what about slides? (2) •Different formats lend themselves to •Slides should be used for illustratingkey different types of display points not for reading out aloud •Well designed slides and good visuals •It is important to think like someone in can be enormously useful, whilst poorly the audience and consider what will be designed slides and inappropriate seen and heard visuals can spoil an otherwise •Charts can be particularly useful as they entertaining presentation can be read quickly and key points highlighted more quickly than with a table

Graphic design of slides Graphic design of slides: Text •Four key elements •What typeface? − Text − Sans serif (Arial ) − Pictures and graphics − Colour − Serif (Times New Roman / Garamond) − Space •Whilst Serif highly legible on paper, they •Design is about manipulating these four. They are all can be difficult for an audience to read related and it is impossible to change one without it impacting on the others •Sans serif easier for an audience to •Designing layouts is not an exact science; what read works in one situation will not necessarily work in another

Page 129 Graphic design of slides: Text Graphic design of slides: Text •No more than two fonts should be used •What typeface? on one slide − Sans serif (Arial ) •Text should not take up more than half − Serif (Times New Roman / Garamond) the visible area •Whilst Serif highly legible on paper, they •As a good rule of thumb there should be can be difficult for an audience to read no more than six words per line and six lines per slide •Sans serif easier for an audience to read

Graphic design of slides: Text Graphic design of slides: pictures and graphics •All capitals are more difficult to read than a mixture of both upper and lower •Can be easy to get carried away with case fancy effects –‘gee whiz’ •In general best to use font size of at •Use only those that are absolutely least 28 points for titles and 18 points integral to the presentation as anything for main body of text else will detract from the information being presented •Text is best highlighted using spacing, italics or colours as underlined or bold •However, used sparingly pictures and text are less easy to read clip art can liven up a presentation, particularly if the subject matter is rather

dry

Graphic design of slides: Colour Graphic design of slides: Space •Two most important areas are •Important not to overcrowd slides as − Text they will look busy and difficult to read − Background •Space can be used to break up text. It is •When projected onto a screen light better to use separate slides than have coloured text against a dark background lots of sub-heading and text on a single work best slide •No more than three colours should be used on one slide

Page 130 And finally…. Summary •Important to keep the style of the slides, •Always start by plotting your data to get a ‘feel’for it including text, colour and any graphic (see the shape of the distribution of the data). effects consistent throughout a •For numerical data a measure of location (average) presentation and variability (spread) are of interest when •Always have a conclusions slide. This summarising observations. should have 3 to 5 summary points, to focus on the ‘take-home’message •For categorical data present the number in each category or express it as a percentage of the total number of patients.

Summary You should now: • Be aware of Tufte’sprinciples for displaying data (Tufte, 1983) Questions? • Understand about different types of data • Have knowledge of basic graphical techniques • Have knowledge of basic summary statistics • Have an awareness of good practice when tabulating and graphing data, and when designing Powerpointslides

Resources Formula for the mean

• Presenting numbers, tables and charts. S Bigwood& n M Spore. OUP, Oxford, 2002 • The visual display of quantitative information. ER x = å xi Tufte. Graphics Press, Cheshire, 1983 i = 1 • How to lie with statistics. D Huff. Penguin Books, London, 1991 n • A Primer in data reduction. ASC Ehrenberg. Wiley, Chichester, 2000. x : mean (x-bar) • Statistical Notes: presentation of numerical data. S : Greek capital letter sigma for the summation Altman DG, Bland JM. British Medical Journal, vol312: pp symbol, sum values from i=1 to n 572. 1996 xi : observation i n : number of observations

Page 131 Formula for the variance Formula for the standard The variance (usually abbreviated to var, s2, or s2) is deviation The standard deviation (usually abbreviated to SD, s, or defined as: s) is defined as the square root of the variance: n 2 ( n å xi - x) 2 Var = s2 = i =1 å(xi-x) n -1 s = i=1 The units of variance are the original units squared e.g. - g/dl2 for Haemoglobin. Therefore we usually use…… n 1

Page 132 Reference list

1. Campbell M J and D Machin (1999). Medical Statistics: £16.99 A Commonsense Approach. 3rd Ed. John Wiley. 2. Swinscow TDV and Campbell MJ. Statistics at Square £11.95 One (10th edition). BMJ Publishing Group, 1996. “Cheap and only basic, but a BMJ best seller.” 3. Bland M (2001). An Introduction to Medical Statistics. 3rd £19.99 ed. Oxford Medical Publications. 4. Coggon. D Statistics in Clinical Practice. 2nd Edition BMJ £11.95 London 2003. 5. Hart A (2001) Making Sense of Statistics in Healthcare £21.95 Radcliffe Medical Press 6. Petrie A, Sabin C (2000). Medical Statistics at a Glance. Oxford: Blackwell Science.

Useful websites*:

The Little Handbook of statistical practice § http://www.tufts.edu/~gdallal/LHSP.HTM § Basically a set of notes. MBBS and MSc level. § Chapters inc: Basics, Confidence intervals and hypothesis tests, Sample size calculations, Non-parametric statistics, Simple linear regression, comparing two measuring devices, multiple regression, analysis of variance and cross-over studies.

Statistics at Square One § http://bmj.bmjjournals.com/collections/statsbk/index.shtml § Self contained book. MBBS. § Chapters inc: Data display and summary, Mean and standard deviation, population and samples, statements of probability and confidence intervals, differences between means, differences between percentages, t-tests, chi-squared tests, exact probability tests, rank score tests, correlation and regression, survival analysis and study design.

Statistics Guide for Research Grant Applicants http://www.sghms.ac.uk/depts/phs/guide/guide.htm#brief

StatSoft electronic textbook: http://www.statsoftinc.com/textbook/stathome.html n Electronic Statistics Textbook offers training in the understanding and application of statistics. n search or go directly to specific chapters.

Statistics Notes in the BMJ http://www-users.york.ac.uk/%7Emb55/pubs/pbstnote.htm

Page 133

StatPages.Net http://members.aol.com/johnp71/javastat.html n General statistical/software web-site: n The web pages listed here comprise a powerful, conveniently-accessible, multi-platform statistical software package. There are also links to online statistics books, tutorials, downloadable software, and related resources

Betty C Jung http://www.bettycjung.net/ n Quality information on the web for and healthcare researchers: n Excel charts ( / Charting and Graphing Data / Charting Data / Excel Charting Tutorials / Charts for Statistics / Box and whisker plots http://peltiertech.com/Excel/Charts/statscharts.html#BoxWhisker

UCLA Statistical Computing Resources web-site http://www.ats.ucla.edu/stat/

Statistics Tutorial n DISCUSS n Discovering Important Statistical Concepts Using SpreadSheets n Web-based interactive spreadsheets, designed for teaching elementary statistics n Web-sites for download and information - http://www.coventry.ac.uk/discuss/ n Uses Excel

Evidence Based n Centre for Evidence Based Medicine (EBM): n http://www.cebm.net/ n courses; manuals; scenarios for problem-based learning; critical appraisal worksheets. n Centre for Evidence Based Nursing: n http://www.york.ac.uk/healthsciences/centres/evidence/cebn.htm n Bandolier: n http://www.jr2.ox.ac.uk/bandolier/ n print and Internet journal about health care, using evidence-based medicine techniques to provide advice about particular treatments or for healthcare professionals and consumers. n Bandolier EBM Glossary (http://www.jr2.ox.ac.uk/bandolier/glossary.html) – comprehensive list of definitions - randomisation, incidence, mean. odds ratio, etc.

Other useful web-sites n Details of all relevant articles published by BMJ since Jan 1998. n http://bmj.bmjjournals.com/collections/ n See Books – How to Read a Paper n BMJ Advice to contributors /Statistical Methods / CONSORT Statement n www.bmj.com n WHO Statistical Information System: n http://www3.who.int/whosis/menu.cfm n health-related epidemiological and statistical information n SPSS “White Papers” (e.g. statistical analysis): n http://www.spss.com/downloads/

Page 134 n (click “complete list”)

* It must be remembered that when using the web for information, any information on the web is only as good as its source – it is always worth questioning who has posted the information and whether they have a particular ‘political’ agenda

Glossary of Terms

(Reproduced with permission from ‘Making Sense of Statistics in Healthcare’ Anna Hart, Radcliffe Medical Press Ltd, Oxon)

Allocation concealment The process whereby the people recruiting subjects to a trial are blinded to the treatment group to which those subjects will be assigned. This is important in order to eliminate selection bias

Alternative hypothesis In hypothesis testing, this hypothesis will be ‘accepted’ when the null hypothesis is rejected. Often the alternative hypothesis is the one in which you are interested

ARR – absolute risk reduction – The difference in rates of an undesirable event between the control and experimental groups.

Assumptions The conditions required by a test or statistical method in order for its results to be valid

Balanced A study of two groups is said to be balanced with respect to a particular variable if the distribution of that variable is similar in the two groups

Bar chart A chart showing the frequencies of the values of a categorical variable. The bars are generally separated, and their lengths are proportional to the frequencies

Baseline measure A measure of some characteristic that is recorded for subjects at the start of a study, before any treatment commences

Bias A systematic error that leads to results which are consistently either too large or too small

Binary A binary categorical variable can take one of two values (e.g. true/false or male /female). Sometimes referred to as a dichotomous variable

Boxplot A chart that is often used to compare two or more samples of ordinal or continuous variables. A boxplot shows the median, lower and upper quartiles, the interquartile range, the maximum and minimum values, and possible outliers

Calibration The process of finding the relationship between two scales or instruments that measure the same thing

Case-control study An designed to find relationships between, for example, a risk factor and a disease. A group of cases (with the disease) are compared with a group of controls (without the disease) with regard to their exposure to the risk factor. The data are summarised by an odds ratio

Page 135

Case report Published details of a clinical case history

Case series The results of a series of cases

Censored Censored data often occur in studies of survival data. The data are censored if the event (e.g death or recurrence of disease) has not been observed during the duration of the study

Census A survey of an entire population

Central limit theorem A theorem which tells you about the distribution of the sample mean of large samples. For large samples the sample mean is Normally distributed

CER – control event rate – The rate at which a particular event occurs in the control group. It is a percentage or fraction or decimal

Cohort A group of subjects who share some characteristic in common, which is studied over time

Confidence interval a A range of values, calculated from the sample of observation, that are believed, with a particular probability, to contain the true parameter value. A 95% confidence interval, for example, implies that, were the estimation process repeated again and again, then 95% of the calculated intervals would be expected to contain the true parameter value. Note that the stated probability refers to properties of the interval and not to the parameter itself, which is not considered a random variable.

Confound The effects of two variables are said to be confounded if they are inseparable. This undesirable phenomenon is usually the result of poor study design

Continuous A variable is continuous if it can take any value in a particular range (i.e. it can take decimal values) e.g. height, weight, blood pressure

Control group A group of subjects used in a study as a comparison with the group of primary interest

Correlation The correlation coefficient is a measure of the degree of linear association between two continuous variables. A value of +1 indicates perfect positive association, a value of –1 indicates perfect negative association, and a value of 0 indicates no linear association. The value is highly sensitive to a few abnormal data values

Cross-sectional study An observational study in which subjects are investigated as one point in time

Data Numbers or values that are collected for analysis

Data dredging The highly undesirable practice of searching through data in an attempt to find an interesting result. It is sometimes called data fishing

Dependent variable A somewhat confusing term that is used in statistical modelling. When one variable is believed to influence another variable, the latter is called the dependent variable. It is sometimes called a response or outcome variable, and is plotted on the vertical axis of a graph

Page 136

Dichotomous Taking two possible values (i.e. binary)

Discrete A variable is discrete if it can only take certain values. These are usually whole numbers (e.g. counts, such as the number of visits to the GP)

Distribution Distributions describe the histograms of whole populations. There are several distributions that are commonly used (e.g. the normal distribution)

Double-blind A trial is double-blind if neither the subject nor the person conducting the assessment of the subject knows to which treatment group the subject has been allocated. Uses of the term vary

EER – experimental event rate – The rate at which an event occurs in the experimental group. It can be expressed as a percentage, decimal or fraction

Equivalence study A study in which the objective is to show that two treatments are equivalent in outcome, as opposed to showing that one is superior to the other. The two types of study need to be designed differently

Estimate A value calculated from a sample when you are really interested in the value for the population. It is an informed guess

Experiment A comparative study in which the researchers are able to control the factor of interest. A typical example is a clinical trial in which one treatment is given to one group of subjects and another treatment is given to a second group of subjects. The researchers determine who receives which treatment

Factor A variable with (a few) discrete levels. The term is also used to describe a condition controlled by a researcher in an experiment (e.g. different treatments)

Geometric mean A type of average, usually close to the median. It is related to the product of all the data values. It occurs when positively skewed data have been transformed by taking logs before analysis

Histogram A chart that is used to represent continuous data. It consists of bars which are adjacent, and whose area is proportional to the frequency for that range of values

Independent Two events are said to be independent if knowing about one tells you nothing about the other

Intention to treat analysis In clinical trials subjects may drop out of the study or change treatment groups. An intention-to-treat analysis retains data from all subjects in the group to which they were originally allocated. This is considered to be the correct way to deal with dropouts

Interaction An interaction exists between two variables or factors if the effect of one depends on the value of the other

Interquartile range The difference between the lower and upper quartiles, which includes the central 50% of the data, used to describe the variability in ordinal or skewed data

Page 137 Interrupted time series A study in which subjects are studied over a period of time – before and then after an event or intervention of interest

Likert-type scale A scale on questionnaires where a subject is asked to what extent they agree with a statement

Linear association A linear association exists between two continuous variables if a reasonable amount of variability in one is explained by a straight-line equation with the other. The scatterplot with show points scattered around a straight line

Longitudinal study A study in which subjects are followed over time. Characteristics are measured at several points in time

Lower quartiles The value below which 25% of the data lie, equivalent to the 25th percentile.

Lurking variable Two variables may be highly correlated, but this does not mean that one directly influences the other. Sometimes there is an a third (lurking) variable that influences each of them. Lurking variables are used to explain why strong correlations may not be evidence of causality

Mean An average value that is computed by adding together all of the values and dividing by the number of values. Colloquially it is called the average, although in statistics there are different types of average

Measure of dispersion Parameter describing the width or spread of a distribution for quantitative data (e.g. standard deviation or variance)

Measure of location Parameter describing the centre of a distribution for quantitative data (e.g. mean, median)

Median The middle value in a set of data. It is most often used when describing skewed or ordinal data

Meta-analysis An analysis which combines the results from several studies (usually in a systematic review) to provide an overall analysis and confidence interval

Minimisation A non-random procedure (scorned by some researchers) sometimes used in RCTs to achieve balance of several groups with regard to a number of variables

Mode The most frequently occurring value, used to describe nominal or ordinal data

Model An equation relation two or more variables

Multiple testing The rather dangerous practice of performing several tests on the same set of data. This is particularly undesirable if the tests are thought of after the data have been collected

Mutually exclusive Two events are mutually exclusive if then cannot both occur together

Negative prediction rate In diagnostic testing, the probability that you do not have the disease when the test is negative. The value of the negative prediction rate can be affected by the prevalence rate

Page 138 Nominal A categorical variable is nominal if it can take a set of values that are not ordered (e.g. ethnic origin)

Non-parametric test A test which requires no distributional assumptions about the data. Note the test itself is non-parametric, not the data!

Normal distribution A symmetrical bell-shaped distribution that is often used to model data. For a Normal distribution the mean and the median will coincide. About 95% of the data from a Normal distribution will lie within plus or minus two standard deviations from the mean.

Null hypothesis The hypothesis that states that there is no effect or difference. We assume that this hypothesis is true, and it is only rejected if there is a weight of evidence against it

Observational study A non-experimental study in which subjects are observed. Examples include cohort and case-control studies

Odds ratio A ratio of odds in two groups, often used in case-control studies as an approximation to estimatating the relative risk

One-sided test A test in which the alternative hypothesis is that an effect or difference is in a particular direction (e.g. greater than zero). If you intend to use a one-sided test, you should say so at the design stage, and you must have very good reason to do so

Ordinal Data in which the various values have a natural order

Outlier A value in a data set which appears to be a long way from the rest of the data. It may be an error or an unusual or interesting value

Parameter A value such as a population mean or standard deviation, which is seldom known. You usually take a sample from the population and estimate the value of such parameters. It is also sometimes used as the name of the variables or coefficients in a regressionmodel

Parametric test A statistical test which relies on the data having a particular distribution (often the Normal distribution)

Percentile The value below which a particular percentage of the data lie e.g. 25% of observations will lie below the 25th percentile. Note that the median is also the 50th centile

Pie-chart A circle that is divided into sections so that the area of each slice is proportional to the number represented. It is used when all subdivisions of the subject are being studied, and you want to show how the relative sizes of the subdivisions differ. Three-dimensional pie charts can be very misleading

Pilot study A small-scale study that is conducted in order to investigate the usefulness of some method or tool (e.g. a questionnaire) that you intend to use in the full-scale study

Placebo An inert substance, indistinguishable from the active drug, which is given to the control group. This enables both subjects and researchers to remain blinded to the treatment allocation

Population The entire set of subjects or items about which you want information

Population parameter Characteristic of a population that you are trying to estimate

Page 139

Positive prediction rate In diagnostic testing, the probability that you do have the disease when the test is positive. The value can be affected by the prevalence rate

Power The probability that you will find a statistically significant difference using a statistical test, when that size of difference actually exists

Power calculation Before starting a study you should estimate the size of the sample that you need in order to have high enough power to be able to detect a clinically significant effect. This process is called a power calculation

Predictor variable Sometimes called an explanatory variable or (rather confusingly) independent variable. The variable that is plotted on the horizontal axis, and that is used in modelling to predict the values of the response variable

Probability A measure of how likely an event is. All probabilities range between 0 and 1, a value of +1 denotes and event is certain to happenand 0 denotes an event is never going to happen

Prospective cohort study A study in which a group of subjects is followed forward in time. Usually the level of risk is measured first, and the subjects are monitored for development of a disease

P-value Very commonly misunderstood, this is the probability of observing a test statistic at least as extreme as that actually observed if the null hypothesis is true. A small P-value is interpreted as strong evidence against the null hypothesis. Confidence intervals are more informative than P-values

Qualitative data Observations or information characterised by measurement on a categorical scale

Quantitative data Data in numerical quantities, such as continuous measurements or counts

Random sample A sample chosen from the population by chance – each member has an equal chance of being selected

Randomisation The method of allocation subjects to treatments using the principle of chance

Randomised controlled trial (RCT) A study in which at least two treatments groups are studied, one of which is a control group. Randomisation is used to allocate the subjects to the treatment groups

Range The difference between the smallest and the largest values in the data

Regression line A straight-line equation that is used to model the relationship between a response variable and one or more predictor variables

Relative risk (RR) The ratio of the risk of some event in one group relative to that in another group

Reliability A tool is said to be reliable if it consistently gives the same results

Page 140 Repeated-measures study A study of subjects where more than one measure is taken on the same subject, usually over a period of time. Measures on the same subject will be associated or correlated, so special measures of analysis are needed

Response variable Sometimes called the outcome variable or the dependent variable. In plots it will be represented on the vertical axis. In modelling it is the variable being predicted by the model

Retrospective study An observational study in which subjects are chosen by disease status and then followed back in time in order to ascertain their exposure to a risk. Typically it is a case- control study

RRR – – The proportion of the original risk that was eliminated by a treatment. 1-RR

Risk Probability of an event happening in a given period of time

Sample A set of people or items chosen for study from a population

Sampling frame The list of the entire population of interest, used to draw a sample

Scatterplot A graph showing the relationship between two continuous variables. Each symbol on the graph is determined by the pair of values of the variables

Sensitivity When using a diagnostic test, the percentage of people with the disease who will test positive

Significance level The probability of rejecting the null hypothesis when it is in fact true. A level of 5% is usually chosen. Sometimes known as the Type I error rate,

Single-blind A study in which the subjects are unaware of which treatment they are receiving. However, usage of the term is inconsistent. Some people use the term to refer to studies where the assessor, but not the subjects are unaware of the treatment allocation

Skewness Data are skewed if the histogram has a long tail on one side

Specificity When using a diagnostic test, the percentage of people without the disease who will test negative

Standard deviation A measure of spread or variability, mainly used for continuous symmetrical data in conjunction with the mean

Standard error A measure of the uncertainty in an estimate from a sample.Strictly speaking it is the standard deviation of the sampling distribution of a statistic (mean, mean difference, proportion, difference in proportions).

Statistic A value calculated from a sample (e.g. the sample mean, sample proportion)

Stratified sampling A method of sampling that is used to compare subsets of a population. Samples are taken from each of the subsets rather than from the population as a whole

Page 141 Survey An observational study that is used to find out the characteristics of a population. The method of sampling is critically important

Survival data Data that arise from studies where the outcome of interest is the time until a particular event (often death). Censored data are often obtained from such a study

Systematic review A summary of all of the medical literature associated with a particular research question. The search for studies must be systematic and comprehensive

Test statistic A statistic that is calculated from a sample and used in a statistical test. A ‘large’ or extreme value of a test statistic will result in a low P-value and thus rejection of the null hypothesis

Time-series plot A plot that shows the change in a variable or variables over time

Transformation if data are not Normally distributed, they are sometimes transformed on to a different scale by a mathematical manipulation. Common transformations are the natural logarithm, square root and reciprocal.

Treatment group A group in a study that receives an active treatment which is under investigation

Two-sided test A test where the alternative hypothesis is that the effect of interest can be in either direction (e.g. where a drug can be worse or better than placebo)

Type I error Rejecting the null hypothesis when it is true (i.e. claiming to have found an effect that is not really there, usually denoted by the Greek letter alpha, 

Type II error Failing to reject the null hypothesis when it is false (i.e. not finding an effect even though it is there), usually denoted by the Greek letter beta,  One minus the Type II error, i.e. 1 –  is usually referred to as the power.

Upper quartile The value below which 75% of the data lie

Validation The process of checking whether a tool actually measures what it is supposed to measure

Variable A characteristic, subject to variability, that can be measured

Variance The value of the standard deviation squared. The units of variance are the original units of measurement squared. The standard deviation is much easier to understand, since it is measured in the original units of the data

VAS – visual analogue scale – A line of fixed length (usually 10cm) with extreme labels at the ends. Subjects are asked to place a mark such as a cross on the scale to correspond to their opinion or condition

Page 142

Figure 1: Statistical methods for comparing two independent groups or samples

Figure 1: Statistical methods for comparing two independent groups or samples

Compare two independent groups

Are the data Are the data Independent Yes Yes continuous? normally samples t-test distributed?

No No

Mann-Whitney U test

Are the data Mann-Whitney U or Chi- Yes ordinal? squared test for trend

No

Large sample, Are the data most expected Chi-squared Yes Yes test nominal? frequencies >5?

No No (i.e. the data Chi-squared are binary) Reduce number of categories by combining or excluding as test appropriate

Large sample, Comparison of all expected Yes two proportions or frequencies >5? Chi-squared test

No

Chi-squared test with Yates’ continuity correction or Fisher’s exact test

Page 143 Figure 2: Statistical methods for differences or paired samples

Compare differences or paired samples

Is the data Are the data Yes Yes Paired t-test continuous? normally distributed?

No No

Wilcoxon matched pairs test

Is the data Sign test or Yes ordinal? Wilcoxon matched pairs test

No

Is the data Stuart-Maxwell test Yes nominal?

No (I.e. the date are binary.)

McNemar’s test

Page 144 Table 1: Statistical methods for two variables measured on the same sample of subjects

Continuous, Continuous, non- Ordinal Nominal Binary Normal Normal

Continuous, Regression Regression Rank correlation One-way Two independent Normal Correlation Rank correlation (Spearman’s r or Analysis of samples t-test (Pearson’s r) Kendall’s t) Variance

Continuous, Regression Regression Rank correlation Kruskall Wallis Mann-Whitney U non-Normal Rank correlation Rank correlation Test test

Ordinal Rank correlation Rank correlation Rank correlation Kruskall Wallis Mann-Whitney U Test test Chi-squared test for trend Nominal One-Way Analysis Kruskal Wallis Test Kruskal Wallis Chi-squared test Chi-squared test of Variance Test

Binary Two independent Mann-Whitney U Mann-Whitney U Chi-squared test Chi-squared test samples t-test test test Fisher’s exact test

Page 145 BMJ Papers Sifting the evidence – what’s wrong with significance tests?

Page 146

Page 147

Page 148

Page 149

Page 150

Page 151

Page 152

Page 153

Page 154 Users’ guide to detecting misleading claims in clinical research papers

Page 155 Page 156

Page 157

Page 158 Page 159

Scope tutorials The visual display of quantitative information

Page 160

Page 161

Page 162

Page 163

Page 164 Describing and summarising data

Page 165 Page 166 Page 167

Page 168 The Normal Distribution

Page 169 Page 170

Page 171

Page 172 Hypothesis testing and estimation

Page 173 Page 174 Page 175

Page 176 Randomisation in clinical investigations

Page 177 Page 178 Page 179

Page 180 Basic tests for continuous Normally distributed data

Page 181 Page 182

Page 183 Mann-Whitney U and Wilcoxon Signed Rank Sum tests

Page 184

Page 185

Page 186

Page 187

The analysis of categorical data

Page 188

Page 189

Page 190

Page 191 Fisher’s Exact test

Page 192 Page 193

Use of Statistical Tables

Page 194

Page 195

Page 196 Exercises and solutions (Taken in part from ‘Statistical questions in evidence-based medicine. Bland & Peacock. Oxford Medical Publications, Oxford 2000)

Displaying and summarising data

1. All patients admitted with suspected myocardial infarction in the Nottingham health district during 1989 and 1990 were studied to determine whether women received the same therapeutic interventions as men. The following table was given (Clarke et al. 1994, BMJ 309, 563-6)

Table 1: Time from onset of symptoms to arrival at hospital Time No. (%) of men N. (%) of (hours) women < 6 2 528 (52) 1 404 (47) 6-12 535 (11) 329 (11) 12-24 340 (7) 209 (7) >24 1 459 (30) 1 046 (35) Total 4862 (100) 2 988 (100)

1.1 What features of this table are clear?

1.2 What is ambiguous about the time categories?

1.3 What type of data is time? What type of data is it here?

1.4 What type of graph might you use to appropriately display these data?

2. The authors of a study looking at the representativeness of different samples from general practice lists presented the graph overleaf as part of their results. They stated that this figure ‘showed the mean deprivation score for the areas in which cases and controls lived according to participation status. Although the selected controls lived in areas of similar material wealth to their corresponding cases, the controls who participated differed markedly from those who did not. Furthermore, we found significant differences (P< 0.05) between the non-participating groups’ (Smith et al, BMJ, 2004, 328, 932).

Page 197

Figure: Distribution of deprivation score by participation status and reason for non-participation

2.1 What type of figure is displayed?

2.2 What is wrong with the above statement?

2.3 Given the quote above, is this an appropriate figure?

3. As part of a study of and risk of coronary heart disease and stroke, serum cotinine concentrations were measured for several groups of men, categorised by their smoking status. Serum cotinine is a biomarker for passive smoking and can provide a summary measure of exposure to passive smoking. Figure 1 was given (Whincup et al, 2004, BMJ, 329, 200-205):

3.1 What type of chart is this? Could another chart have been used instead? Is there any other information that you would like to know?

3.2 If you were to display the data in this figure numerically, would it be best to use the mean and standard deviation (SD) or would you use the median and interquartile range? Please explain your answer.

Page 198 Figure 1: Distribution of serum cotinine concentrations for lifelong non-smokers and former smokers

4. The following results were obtained as part of a study to examine the effect of inhaled corticosteriods on episodes of wheezing associated with viral infections in school age children (Doull et al, 1997, BMJ, 315: 858-862). Mean (SD) Median (Range) Percentage of days with symptoms: Baseline Treatment group (n=50) 26 (27) 21 (0-93) Placebo group (n=44) 27 (30) 19 (0-100)

Treatment Phase Treatment group (n=50) 16 (16) 10 (0-63) Placebo group (n=44) 26 (29) 16 (0-100)

Frequency of episodes of reduced peak flow during treatment (No/year) Treatment group (n=50) 5.1 (1.8) 5.2 (1.8-8.9) Placebo group (n=44) 5.0 (1.9) 4.8 (1.9-9.6)

Average severity of episodes of reduced peak flow during treatment (min flow (l/min)) Treatment group (n=50) 181 (52) 180 (54-287) Placebo group (n=44) 169 (47) 166 (89-300)

Page 199

4.1 What type of data is the percentage of days with symptoms? What type of chart(s) would be appropriate for displaying it? Which provides the best estimate of the location and spread of the distribution (a) the mean and SD or (b) the median and the range? Please explain your answer

4.2 What type of data is the frequency of episodes of reduced peak flow? What type of chart(s) would be appropriate for displaying it? Which provides the best estimate of the location and spread of the distribution (a) the mean and SD or (b) the median and the range? Please explain your answer

4.3 What type of data is the average severity of episodes of reduced peak flow? What type of chart(s) would be appropriate for displaying it? Which provides the best estimate of the location and spread of the distribution (a) the mean and SD or (b) the median and the range? Please explain your answer.

5. The following table has been taken from a recent article in the British Dental Journal (British Dental Journal 203, E11 (2007)). It shows the demographic characteristics of patients in a trial examining post-operative pain experience for three different types of perioperative injection techniques.

5.1 What type of data is age? What type of chart(s) would be appropriate for displaying it? Why do you think that the authors have chosen to display these data using the median, quartiles and range? Do you think that for these patients age is negatively skewed, Normally distributed, or positively skewed? Please explain your answer.

5.2 What type of data is the number of teeth extracted? What type of chart(s) would be appropriate for displaying it?

Page 200 5.3 What type of data is the time to eye opening? What type of chart(s) would be appropriate for displaying it?

5.4 Looking at the table below, can you think of a better way of displaying these results?

Page 201 Displaying data – solutions

1.1 The table has a clear title, and has headings for all rows and columns. Frequencies and percentages are labelled and the sum of percentages is given at the bottom of the columns to show that percentages are calculated by columns. The units of time are given as hours.

1.2 ‘Twelve hours’ is contained in both the second and third categories. This could be just a typographical error in which case we do not know if those who waited 12 hours are included in the second or third category. Alternatively it could be that those who waited 12 hours are included in both categories and are double counted

1.3 Time is usually a continuous variable. It is only limited by the scale of measurement. However, as it is displayed here it is used as an ordinal variable. There are ordered categories going from < 6 hours to > 24 hours.

1.4 A clustered barchart would be a good way of displaying these data, as time is ordinal. If we have time as a continuous variable we could use either a stem-and-leaf plot, a histogram, a box-and-whisker plot, or a dotplot

Figure: Time from onset of symptoms to arrival at hospital, by sex

60

Men (n=4862) Women (n=2988) 50

40

% 30

20

10

0 < 6 6-12 12-24 >24

Time from onset of symptoms to arrival at hospirtal

2.1 This is a box and whisker plot. Deprivation score is a continuous measure. The box illustrates the interquartile range and thus contains the middle 50% of the data. The median is shown by the horizontal line across the box and the whiskers extend to the largest and smallest values excluding the outlying values. The outlying values are those

Page 202 values more than 1.5 box lengths from the upper or lower edges and are displayed as dots.

2.2&3 Mike Campbell and I sent off a rapid response to this report and our reply pretty much covers the answer to this question:

We welcome the use of a box-whisker plot in this paper as we believe they are an under-utilised method of displaying data(1). However a figure should either describe data or complement the analysis. The authors are using it to do the latter and in this case a box-whisker is inappropriate. This is immediately apparent because the analysis is of means and yet a conventional box-whisker plot displays medians (the authors state the figure displays means). The figure gives the impression that the groups are actually quite similar, and yet the text conveys the fact that there are significant differences between their means (but fails to give estimates or confidence intervals). We think estimates and 95% confidence intervals would have been a better choice of displaying data here.

(1) Swinscow TDV and Campbell MJ. Statistics at Square One Tenth Edition London: BMJ Books 2002. p 6-7.

The authors responded thus: Thank you for your comment regarding the use of the box and whisker plot in our recent short report. We totally agree that the data are better displayed by estimates and 95% confidence intervals to complement the analysis, and were originally submitted in this format. However, it was an BMJ editorial decision that the distribution of the deprivation scores should be shown using a box and whisker plot.

3.1 This is a barchart.. Box and whisker plots of the two groups would also have been a useful way of displaying these data, since from these it would be possible to see the skewed nature of the data and the difference between the two groups in terms of their serum cotinine concentrations. As the y-axis is % frequency, it would be useful to know the numbers that these are based on. If one group had many fewer men in it, you would be slightly alarmed and have much less confidence in the results for this group. As it is, the numbers are 945 never smokers and 1160 former smokers which are reasonably similar.

3.2 Serum cotinine is not Normally distributed. It is not symmetrically distributed about a central value. Looking at the figure it can be seen that it is positively skewed, with a clustering of values at the lower end of the range and a long tail of higher values. Median and interquartile range, would be better than the mean and SD. To use the mean could be misleading as it is influenced by the few high serum cotinine concentration values, which would tend to inflate the value of the mean. The median is much less influenced by this (sadly, the paper reported neither and so you will just have to take our word for this).

4.1 The percentage of days with symptoms is continuous data, although it is bounded by 0 and 100%. It could be displayed either using a histogram, a stem and leaf plot or a box- and-whisker plot. The data is clearly skew. There are several reasons why we can conclude this:

Page 203 (1) the mean and standard deviation are the same. Constructing a reference range using these numbers would give a negative percentage for the lower limit, which is clearly not possible. Having the mean ans SD similar is not always a sign of skewed data, provided that negative numbers are possible, but if they are not possible than this is a good indicator that the data are skewed;

(2) the mean is much greater than the median;

(3) The range is large and the median is much closer to the lower limit of the range.

The range goes from 0 to 100 for the placebo group. In fact its lower limit is 0 for all groups implying that there is at least one individual with no symptoms during the study, whilst 100 implies that at least one individual had symptoms on every day of the study, however, with the range you don’t know whether this was a single person or lots of people and you also don’t know how rare this is.

4.2 The frequency of episodes of reduced peak flow is discrete numerical data (could also say that it is count data). It could be displayed as a barchart. It could also be diplayed as a histogram, dotplot or stem and leaf plot. The data looks to be plausibly Normally distributed, as the mean and median are very similar and the SD is much smaller than the mean. In this case either the mean and SD or the median and range would be reasonable.

4.3 The average severity of episodes is continuous. It could be displayed either using a histogram, a stem and leaf plot or a box-and-whisker plot. The data looks to be plausibly normally distributed, as the mean and median are very similar and the SD is much smaller than the mean. In this case either the mean and SD or the median and range would be reasonable.

5.1 Age is a continuous variable. It could be displayed graphically using either a dotplot, stem-and-leaf plot, box-and-whisker plot or histogram. The authors have chosen to use the median, interquartile range and range to summarise age, as age is skewed. You could say that as the median is (marginally) closer to the upper quartile and max than the lower quartile and minimum, then the data is probably negatively skewed.

5.2 Number of teeth extracted is count (discrete) data. Could use a barchart. However, if the rage was large (say greater than 15) then could use a histogram

5.3 Time to eye opening is a continuous variable. It could be displayed graphically using either a dotplot, stem-and-leaf plot, box-and-whisker plot or histogram.

5.4 The contrast of interest is between the three groups over time i.e. does the response differ between the three groups over time. This is not immediately obvious from this chart. This could be made more clear by re-arranging the data thus:

NLA (n=18) IFL (n=17) ITR (n=19) Median (IQR) Median (IQR) Median (IQR) 0 min 2.5 (1.8 to 4.5) 4.0 (1.0 to 4.0) 4.0 (0.0 to 4.0) 15 min 2.0 (0.0 to 4.0) 1.0 (0.0 to 1.0) 0.0 (0.0 to 3.0) 30 min 0.5 (0.0 to 1.0) 0.0 (0.0 to 1.0) 0.0 (0.0 to 1.0)

Page 204 Sampling with confidence

1. The mid upper arm circumference (MUAC), in millimetres was measured as part of a study of the nutritional status of a population of rural Indian children aged 12 to 60 months. Below are the values for a small sample of 16 of these children. The mean for this sample is 149.5mm and the standard deviation (SD) is 12.6mm.

128 162 158 156 148 148 146 136 164 150 148 158 154 172 128 136

1.1 Plot these data on a dot plot (the values are drawn in a vertical line, with the vertical axis representing the actual values).

Eg.

180

170

160

150 M U A C ( m m) 140

130 Etc….

120

1.2. Draw lines on this chart representing the mean MUAC and the reference range (see notes from last week)

1.3 Now draw the 95% confidence interval for the mean

1.4 Compare the width of the reference range and the confidence interval. What do you notice?

2. For the following statements decide which are true and which are false:

A 95% confidence interval for the mean: a. Is wider than a 99% confidence interval b. May be regarded as a range of plausible values for the population mean c. Includes 95% of the values in a population d. Is centred on the sample mean

Page 205

3. Questionnaires were sent to a random sample of 200 GPs in East Anglia asking about their diagnosis and treatment of hypertension. Replies were received from 125 GPs. The questionnaire asked what was the minimum blood pressure they would regard as showing hypertension, and what was the minimum pressure at which they would begin treatment. The study found that the cut-off above which treatment would be given was higher than that used for diagnosis (Dickerson & Brown 1995, BMJ, 310, 547)

3.1 What is meant by a random sample?

3.2 Why was a random sample used here?

3.3 How might the response rate affect one’s interpretation of the findings?

4. During an outbreak of salmonella infection, a particular brand of peanut flavour savoury snack was implicated. The potential link between this snack and the likelihood of infection was investigated using a case-control study design (Killalea et al 1996, BMJ, 313,1105- 1107). A case control study is one in which a group of subjects (cases) with the disease or condition of interest are compared to a group of subjects (controls) without the disease.

4.1 Why might this approach have been used for this study?

4.2 Can you think of any potential problems with this approach?

5. A recent randomised controlled trial (RCT) investigated whether the dose of inhaled corticosteroids could be reduced without adversely affecting asthma control. Patients were randomly allocated to one of two groups (a) no change in medication (control group), (b) 50% reduction in medication (step-down group). At the start of the study the step-down group were on a higher dose of beclomethasone dipropionate per day than the control group (difference of 62.3μg, 95% confidence interval -93.7μg to 218.3μg). The investigators found that after one year the step-down group received less beclomethasone dipropionate per day than the control group (348μg, 95% confidence interval 202μg to 494μg) Hawkins et al, 2003, BMJ, 326, 1115-1120).

5.1 Is there a difference in daily beclomethasone dipropionate doses between the step-down and control group, at the start of the study and after one year? Please comment on the differences between the groups and the confidence intervals.

Page 206 Sampling Solutions

1.1

180

←Upper reference range

170

160 ←Upper confidence interval

150 ←Mean

M U A C ( m m) ←Lower confidence interval 140

130

←Lower reference range 120

1.2 The 95% reference range is between the mean +/- 2 x sd

i.e. the 95% reference range is : 149.5 – (2 x 12.6) to 149.5 + (2 x 12.6) 124.3 mm to 174.7 mm

1.3 The 95% confidence interval is between the mean +/- 2 x se,

(remember that the se = sd/√n; in this case, the se = 12.6/√16 = 12.6/4= 3.15mm)

So the 95% CI is 149.5 – (2 x 3.15) to 149.5 + (2 x 3.15) 143.2mm to 155.8mm

1.4 For this example, the reference range encloses all the data i.e. all the data are between the lower and upper limits. In general you would expect that approximately 95% of the data would be enclosed by the reference range and so for this example you would expect 15 of the 16 observations to lie within the reference range and 1 of the 16 to lie outside it.

In addition the 95% confidence interval for the mean is smaller. For this sample you are 95% confident that the true population mean value lies between the limits 143.3 mm to 155.8 mm and the best estimate is 149.5mm.

2.a This statement is untrue. The 99% confidence interval is larger than the 95% confidence interval. A 95% confidence interval would enclose the true population mean 95% of the time, whereas a 99% confidence interval encloses the true mean 99% of the time.

Page 207

2.b This is true.

2.c This is untrue. The 95% reference range would usually include 95% of population values. However, the confidence interval is used for estimating the population mean, not the spread of the actual data

2.d This is true. The 95% confidence interval for the mean is calculated as the mean +/- 1.96*standard error, and thus is symmetrical about the mean

3.1 A random sample is a group of subjects taken from a larger population so that each member of the population has an equal chance of being selected. The characteristics of the subjects do not influence the chance of being chosen.

3.2 It is done so that the sample will be representative of the population and we can use the sample to tell us something about the population. Here the replies of the sample will provide information about the views of all GPs in East Anglia, without the expense of asking them.

3.3 The response rate was 63% which is quite low, although not unusual in general practice research. It is possible that GPs who replied to the survey were different from those who did not in ways which were related to the treatment of high blood pressure. Hence the sample may not be representative of all GPs.

4.1 It is impossible to randomly allocate subjects to have food poisoning or not. In addition, one of the advantages of this type of study is that it can be done relatively quickly.

4.2 As the subjects are not chosen randomly there is the potential for bias in their selection. There is also the problem of potential recall bias – the cases may be more likely to remember an event, particularly if it had unpleasant consequences for them than the respondents. There may be problems in ascertaining the accuracy of retrospective data, i.e. data collected after the event.

5.1 It is perhaps best understood by referring to the figure below. At the start of the study the difference between the study groups with respect to the amount of beclomethasone dipropionate is 62.3μg. However, it could be as low as -93.7μg or as high as 218. 3μg. The thing to note is that the confidence interval includes the value 0, i.e. no difference and so we would conclude that there is insufficient evidence of a difference between the groups (this last point is covered next week and is perhaps a bit advanced for this week).

The difference after a year is 348μg, though this could be as low as 202μg or as high as 494μg. The thing to note is that the confidence interval does not include 0 and so we would conclude that there does appear to be a difference between the groups with respect to the amount prescribed after one year.

Page 208 Plot of the mean and 95% confidence interval for the difference in dosage between the step-down and control groups, at baseline and 12 months

300

200

100

0

-100

-200

-300

-400 D i f e r n c , µ g ( s t p do w - o l) -500

-600 Baseline 12 months

Page 209 Estimation and hypothesis testing

1. One hundred and forty one babies who developed cerebral palsy were compared to a control group of babies made up from the babies who appeared immediately after each cerebral palsy baby in the hospital delivery book. Hospital notes were reviewed by a researcher who was blind to the baby’s outcome. Failure to respond to signs of fetal distress by the medical staff was noted in 25.8% of the cerebral palsy babies and in 7.1% of the delivery book babies. The difference was 18.7 percentage points, with standard error 4.2 and the 95% confidence interval was 10.5 to 26.9 percentage points. (Gaffney et al 1994, BMJ,308, 743-50).

1.1 What is the statistical null hypothesis for this study? What is the alternative hypothesis?

1.2 What is meant by ‘the difference was 18.7 percentage points’?

1.3 What can we conclude from the 95% confidence interval?

2. A randomised controlled trial was used to investigate the cost effectiveness of community leg ulcer clinics. 233 patients were randomly allocated to either intervention (120 patients, treatment at a leg ulcer clinic) or control (113, usual care at home by district nursing service) (Morrell et al 1998, BMJ, 316, 1487-1491). At the end of 12 months the mean time (in weeks) that each patient was free from ulcers during follow up was 20.1 and 14.2 in the clinic and control groups, respectively. On average, patients in the clinic group had 5.9 more ulcer-free weeks (95% confidence interval 1.2 to 10.6 weeks) than the control patients. Mean total NHS costs were £877.60 per year for the clinic group and £863.09 for the control group (P=0.89).

2.1 Is there a statistically significant difference between the two groups with respect to the number of ulcer-free weeks? Please discuss your answer.

2.2 What is the standard error of the difference in mean time?

2.3 Is there a statistically significant difference between the two groups with respect to the cost to the NHS of treating the patients over the 12 month period? Would you expect the confidence interval for this difference to include the value for ‘no difference’? Please discuss your answer.

2.4 What would you conclude from the information above?

2.5 Is there any other information that you would like before making your conclusions?

3. A recent study investigated whether the measles, mumps and rubella (MMR) was associated with bowel problems and developmental regression in children with autism (Taylor et al, BMJ, 2002 324, 393-6). The authors reviewed the case notes for 278 children with core autism and 195 with atypical autism from five health districts in north east London born between 1979 and 1998. This time frame was chosen as it included the date when the MMR vaccination was introduced in October 1988. The authors examined whether the proportion with either developmental regression or bowel problems changed during the 20 years from 1979. The P-values associated with the change over time were 0.5 and 0.47 for developmental regression and bowel problems respectively.

Page 210 In addition the authors examined whether there was any association between bowel problems and developmental regression. Of the 118 children with developmental regression, 26% reported bowel problems, whilst of the 351 without developmental regression 14% reported bowel symptoms. The difference was 12.3% (95% confidence interval 4.2% to 21.5%).

3.1 Write suitable statistical null hypotheses for this study. What are the alternative hypotheses to these?

3.2 Was there a statistically significant change in the proportions with developmental regression during the 20 year study period?

3.3 Was there a statistically significant change in the proportions with bowel problems during the 20 year study period?

3.4 What does the confidence interval for the difference in the percentage with bowel problems for the children with and without developmental regression tell you? Would you expect the P-value for this difference to be greater than or less than 0.05?

4 A UK study of factors affecting the outcome of pregnancy among 1513 women reported that the overall incidence of pre-term births was 7.5%, SE= 0.68%, 95% CI 6.1 to 8.8% (Peacock et at 1995, BMJ, 311, 6531-5).

4.1 What is meant by SE= 0.68%?

4.2 What is meant by 95% CI 6.1 to 8.8%?

4.3 How would the confidence interval change if 90% limits were used?

4.4 How would the confidence interval change if 99% limits were used?

4.5 Another study conducted at about the same time in Denmark and including 51851 women, reported that the overall incidence of pre-term birth was 4.5% (95% CI 4.3 to 4.7). Explain why this 95% CI is narrower than that reported in the UK study. Do you think that there is a real difference in pre-term birth rates between the two populations being studied?

5 A clinical trial compared two drugs to treat rheumatoid arthritis. Drug A was the standard treatment and Drug B was the new treatment. The was the reduction in pain score after 4 weeks of treatment; pain was scored on a 20 point scale from 0 (no pain) to 20 (worst pain imaginable). A total of 40 patients (20 per treatment arm) took part in the trial. At the end of the trial the following statement was made:

‘the two drugs were significantly different from one another with respect to pain reduction (p=0.01). Drug A showed an average reduction of 3 units, while for Drug B the average reduction was 8 units. Hence the mean difference was 5 units. The 95% confidence interval for this difference was from 3 to 7 units’.

Explain this statement as if you were talking to a lay person

Page 211 Estimation and Hypothesis Testing Solutions

1.1 The null hypothesis is that the percentage of deliveries where there was a failure to respond to signs of fetal distress did not differ between the cerebral palsy babies and the delivery book controls. The alternative hypothesis is that there was a difference between the two groups with respect to failure to respond to signs of fetal distress.

1.2 Failure to respond to signs of fetal distress was noted in 25.8% of the cerebral palsy babies and in 7.1 % of the delivery book babies. The difference between these two percentages was 25.8 – 7.1 = 18.7. This is the actual or absolute difference in the two percentages and so is expressed in percentage points, and is sometimes referred to as the absolute risk difference or absolute risk reduction (as we shall learn in a later lecture). This distinguishes it from a relative difference where we might, for example say that the rate in the cerebral palsy babies was 3.6 (i.e. 25.8 / 7.1) times more than the rate for the delivery book babies.

1.3 The 95% confidence interval shows that the difference between the two groups is estimated to be at least as large as 10.5 percentage points (lower limit of the confidence interval) and may be as great as 26.9 percentage points (upper limit of the confidence interval). Since the interval excludes 0.0, there is good evidence for a real difference in the groups in the population from which the samples come

2.1 On average the patients in the clinic group had 5.9 more ulcer-free weeks than the control group and the 95% CI for this difference ranged from 1.2 to 10.6 weeks. As this confidence interval does not include 0 weeks we can conclude that there was a significant difference between the two groups with respect to the number of ulcer-free weeks over the 12 month study period.

2.2 We know that the 95% confidence interval is approximately

mean + / - 2 * se

Thus can use either the lower limit or the upper limit to obtain the se:

se = (mean – lower limit) / 2 = (5.9 – 1.2) / 2 = 2.35 or

se = (upper limit – mean) / 2 = (10.6 – 5.9) / 2 = 2.35

2.3 Mean NHS costs were £877.60 per year for the clinic group and £863.09 for the control group and the P-value for the difference was 0.89. As this value is greater than the critical value of 0.05, we can conclude that there is no evidence of a statistically significant difference between the groups with respect to cost of treatment (technically speaking – there is insufficient evidence to reject the null hypothesis).

2.4 From the information above, it would be reasonable to conclude that community based leg ulcer clinics with trained nurses using four layer bandaging is more effective than

Page 212 traditional home based treatment, in terms of the number of ulcer free weeks. This benefit is achieved at a marginal additional cost. If we divide the mean difference in costs between the two groups by the mean difference between the groups in ulcer free weeks by we get something called the incremental cost effectiveness ratio e.g. £14.51/5.9 weeks, i.e. it costs about £2.46 to achieve an extra ulcer free week.

2.5 The other leg ulcer related outcome variables, such as pain experienced, recurrence rates for leg ulcers, between the groups, quality of life and time to heal for leg ulcers. Whilst the study may have had a single primary outcome, there may be other outcomes of interest of importance and it is good to have an overall idea of the effectiveness of the treatment, both from a clinical perspective and that of the patients.

3.1 There were several options here

(i) the proportion with developmental regression did not change over the 20 year study. The alternative to this was that the proportion did change over the study period.

(ii) the proportion with bowel problems did not change over the 20 year study. The alternative to this was that the proportion did change over the study period.

(iii) the proportion reporting bowel problems did not differ between the group with developmental regression and the group without developmental regression. The alternative to this was that the proportion was different between the two groups

3.2 There was no significant change in the proportions reporting developmental regression during the 20 year study period, as the P-value for this difference was > 0.05 (technically speaking – there is insufficient evidence to reject the null hypothesis at the 5% level).

3.3 There was no significant change in the proportions reporting bowel problems during the 20 year study period, as the P-value for this difference was > 0.05 (technically speaking – there is insufficient evidence to reject the null hypothesis at the 5% level).

3.4 The 95% confidence interval shows that the difference in the percentages with bowel problems between those with developmental regression and those without is estimated to be at least as large as 4.2 percentage points and may be as great as 21.5 percentage points. Since this interval excludes 0.0, there is good evidence for a real difference in the population from which the samples come. The P-value for this difference would be < 0.05 (in fact it is 0.003).

4.1 This is the standard error. Estimates of population values vary from sample to sample and therefore have a theoretical distribution, the sampling distribution. The standard error of an estimate is a measure of the variability of this distribution. The standard error is the standard deviation of the sampling distribution of the sample estimate. The standard error therefore provides information about the precision of the estimate and is used to calculate confidence intervals around the estimates. Here the value of 0.68 is the standard error of the percentage of pre-term births. The units of the standard error is the same as for the sample estimate and hence it is in percentage points. In the context of the previous sessions the standard error is used to assess the spread of the mean whilst the standard deviation assesses the spread of the patients.

Page 213

4.2 This is the 95% confidence limit. It is a range of values which we estimate will contain the true population percentage of pre-term births, for 95% of samples. This is in the sense that if a large number of samples were taken from the same population, then 95% of the calculated confidence intervals would contain the population percentage. This implies that 5% of these samples would not contain the true population percentage. Unfortunately, the calculated CI does not come pre-labelled with the information that it is a good or correct CI and includes the true population value. We have to assume that it does! Here we can deduce that the population value is very likely to lie between 6.1% and 8.8%, but our best estimate of it is 7.5%.

4.3 If 90% limits were used the confidence interval would be narrower and fewer (90% rather then 95% of) confidence intervals from possible samples would contain the population incidence. In this case, if we repeated the sampling from the population, say 100 times then on average, 90 of the calculated CIs would include the true value and 10 would not. Again with only one actual CI, we do not know whether or not this calculated interval includes the true population value. Thus the estimated range of possible values would be narrower but there would be more chance of being wrong, and the CI not containing the true population value (the 90% limits would be 6.4 to 8.6%).

4.4 If 99% limits were used the confidence interval would be wider and more (99% of) confidence intervals from possible samples would contain the population incidence. Thus there would be less chance of being wrong but the range of possible population values would be greater (the 99% confidence limits would be 5.8 to 9.2%). See figure below

4.8

4.7

4.6

4.5

4.4

4.3 I n c iden e o f p r - t m bi h a ( %)

4.2 90% CI 95% CI 99% CI

Page 214

4.5 The Danish study included many more subjects than the UK study and so the estimate of the pre-term birth incidence is much more precise. Hence, the 95% confidence interval is narrower. The percentage pre-term in the UK study is 3 percentage points higher than in the Danish study and the two 95% CI do not overlap. Hence there is some evidence for a real difference (see figure below)

Mean and 95% confidence interval plots

10

9

8

7

6

I n ci de c e o f p r - t m b i h s ( %) 5

4 Peacock et al Danish study

5. It is likely that there is a real difference between the two treatments. We would expect that in general for an average patient they would experience 5 units less pain if they were to take drug B rather then drug A. However, this can vary from between 3 to 7 units

Page 215 Risk

1. A study in 2002 reported that women taking a particular form of HRT were more likely to get breast cancer. The Sun reported this as “HRT danger for women: Breast cancer up 26%”. This was based upon the following: the risk of breast cancer in the women not given HRT was 30 per 10,000. The risk in women given HRT was 38 per 10,000

1.1 What is the relative risk of breast cancer in women taking HRT compared to those not taking HRT in this study?

1.2 What is the absolute risk excess of breast cancer per 10,000 women on HRT?

1.3 What is the number needed to treat to harm for HRT and breast cancer?

2. A warning was released in 1995 on the safety of a subset of the ‘Pill’ known as third generation. The warning said that women on these third generation pills were twice as likely to get DVTs (clots in the leg). The absolute risk for the second generation pill was 15 per 100,000 and the absolute risk for the third generation pill was 30 per 100,000.

2.1 How many women would need to be treated with a third generation pill compared to a second generation pill to cause one extra DVT?

2.2 The absolute risk per year of DVT in pregnancy is 80 cases per 100,000. What is the absolute risk reduction if women go onto the third generation pill rather than get pregnant?

3. “Three well-controlled clinical trials carried out in children with major depressive disorder compared the effect of Paxil and placebo and found that Paxil(Seroxat) did not work any better than placebo”

“Surveys have shown that incidents of self-harm and potentially suicidal behaviour were between 1.5 and 3.2 times higher among young people taking the drug [Seroxat/Paxil] than those who were not”.

“Certain possible suicide-related behaviours, including suicidal thoughts and attempts, [are] more common in children receiving Paxil. The risk of these events in the study was about 3 times greater with Paxil compared to placebo. There were no deaths in these trials.

3.1 Based on these statements, would you prescribe Paxil?1

1 18,000 prescriptions per year

Page 216

4. A recent cohort study published in the BMJ examined the risks or benefits on cancer associated with oral contraceptive use. One of the tables they produced showed the absolute numbers developing cancer by whether the participant had ever used an oral contraceptive (Hannaford et al. BMJ 2007;335;651-; originally published online 11 Sep 2007):

Table: Number of women in the main dataset who had ever/never used oral contraceptives classified according to whether they developed cancer or not.

Used oral contraceptive Ever Never Cancer 2 485 1 392 No cancer 26 277 15 796 Total 28 762 17188

4.1 What is the relative risk of developing cancer for those individuals who have ever used oral contraceptives compared to those who have never used them?

4.2 What is the odds ratio of developing cancer for those individuals who have ever used oral contraceptives compared to those who have never used them? How does this compare to the relative risk. In this case would you feel confident using the two interchangeably?

4.3 Based on the table above would you feel confident that the risk of cancer is significantly different in the ever used group compared to the never used group. Are there any additional things that you would need to know/consider?

5. Ninety-nine pregnant women with dystocia (difficult childbirth or labour) were allocated at random to receive immersion in water in a birth pool (intervention group, n=49) or standard augmentation for dystocia (control, n=50) in a randomised controlled trial to evaluate the impact of labouring in water during the first stage of labour (Cluett et al 2004, BMJ, doi:10.1136/bmj.37963.606412.EE (published 26 January 2004). The main outcome was use of epidural analgesia at any stage of labour. Forty-seven percent of women in the intervention group had epidural analgesia during labour, compared to sixty-six percent in the control group.

5.1 Display these data in the form of a 2x2 contingency table

5.2 What is the relative risk of epidural for the intervention group compared with the control group?

5.3 What is the odds ratio of epidural for the intervention group compared with the control group. Compare this OR estimate with the RR estimate from 5.3; what do you notice?

5.4 Find the absolute risk difference for epidural for labour in water compared to augmentation. What is the number needed to treat for the intervention (labour in water) to stop one extra epidural compared to control (standard augmentation)?

Page 217 Risk Solutions

1.1 On HRT Not on HRT (per 10,000) (per 10,000) Breast cancer 38 30 No breast cancer 9,962 9,970

‘Risk’ of breast cancer on HRT = 38/10,000 = 0.0038 ‘Risk’ of breast cancer not on HRT = 30/10,000 = 0.0030

Relative Risk = 0.0038/0.0030 = 1.27

Thus patients on HRT are 1.27 times more likely to develop breast cancer than those not on HRT.

1.2 Absolute risk excess = 0.0038 – 0.0030 = 0.0008

Which is 8 additional cases of breast cancer per 10,000 women on HRT

1.3 Number needed to treat to harm is 1/0.0008 = 1250

Need to treat 1250 women with HRT to cause one additional case of breast cancer

2.1 On 3G pills On 1st or 2nd G pills DVT 30 15 No DVT 99,970 99,985 Total 100,000 100,000

‘Risk’ of DVT on 3G pill = 30/100,000 = 0.00030 ‘Risk’ of DVT on 1st or 2nd G pill = 15/100,000 = 0.00015

Relative Risk = 0.00030/0.00015 = 2

Thus patients on 3G pill are twice as likely to have a DVT than patients on 1st or 2nd G pills.

Absolute risk excess = 0.00030 – 0.00015 = 0.00015

Which is 15 additional cases of DVT per 100,000 women on 3G rather than other pills

Number needed to treat to harm is 1/0.00015 = 6667

Need to treat 6667 women for one year to cause one additional case of DVT

Page 218 2.2 On 3G pills Pregnant DVT 30 80 No DVT 99,970 99,920 Total 100,000 100,000

‘Risk’ of DVT on 3G pill = 30/100,000 = 0.00030 ‘Risk’ of DVT whilst pregnant = 80/100,000 = 0.00080

Relative Risk = 0.00030/0.00080 = 0.375

Absolute risk reduction = 0.00080 – 0.00030 = 0.0005

Which is 50 less cases of DVT per 100,000 women on 3G rather than if they were to become pregnant

Contraceptive Pill Scare (October 1995)

Conclusions • Committee on Safety of Medicine (CSM) issued warning of association between 3rd generation contraceptive pill compared to other oral contraceptives and deep vein thrombosis (DVT).

• Relative risk of 2 widely reported I.e. the risk in the exposed group (those on the 3G pill) twice that in the unexposed group (those on other pills) .

• Women advised to discuss risks with GP at time of next prescription.

• Story leaked to the press before doctors could be properly informed.

CSM warning based on the following results: • 30 cases DVT per 100,000 users per year 3rd gen. pill • 15 cases per 100,000 users per year 2nd gen. pill

Thus: Relative Risk 3rd gen. to 2nd gen. = 2 However: Absolute risk reduction = 15 per 100,000 Thus: Number needed to get one extra DVT on 3rd gen. = 6700 women years.

Note also: • 5 cases per 100,000 per year non-users • Risks due to pregnancy and abortion far exceed those due to 3rd generation contraceptive pill e.g. 80 cases per 100,000 per year for pregnant women • Mortality from DVT ~ 1-2%

• Advice to consult GP misunderstood.

• Due a misunderstanding of risk, many women stopped taking the pill immediately or did not finish course.

Page 219 • Consequently there was an increase in unwanted pregnancies and in the number of abortions.

Figure : Legal abortions performed per quarter, 1991-1996

50000

48000

46000

44000

42000

40000

38000 Pill scare Nu m ber o f lega l aborti ns per quarter 36000

Jan 91 Jan 92 Jan 93 Jan 94 Jan 95 Jan 96

3 This is a tricky one as there is not really enough information in these statements to make a decision. For the first statement you would also want to know the size of the trials, as they may have been under-powered i.e. too small to detect a difference, even if one existed. It could be that they did find a clinically significant difference between the groups, but the number of participants was too small for them to be able to conclude that this was statistically significant.

For the second and third statements it would be helpful to know what the absolute risks were associated with these relative risks (i.e. the prevalence) as though the relative risks are over two, if the absolute risk was 1 in 10,000 compared to 2 per 10,000 then for single individual the risks are not that great. You would want to know about how the surveys were conducted, what the response rates were, who were the controls and how they were recruited etc.

4.1 Used oral contraceptive Ever Never Cancer 2 485 1 392 No cancer 26 277 15 796 Total 28 762 17 188

‘Risk’ of cancer for ‘pill’ group = 2485/28762 = 0.086 ‘Risk’ of cancer for ‘no pill’ group = 1392/17188 = 0.081

Relative Risk = 0.086/0.081 = 1.06

Page 220

Thus patients who have ever taken the pill are 1.06 times more likely to develop cancer than those who have never taken the pill.

4.2 Odds of cancer for ‘pill’ group = 2485/26277 = 0.095 Odds of cancer for ‘no pill’ group = 1392/15796 = 0.088

Odds ratio = 0.095/0.088 = 1.09

Thus the odds of developing cancer for those patients who have ever taken the pill is 1.09 times greater than for those who have never taken the pill. This value is very similar to the relative risk. This is because the value of the odds ratio approaches that of the relative risk if the event of interest is rare. In this case it is about 8%. In this case you could probably regard the odds ratio as a relative risk

4.3 This is hard to say, just based upon the relative risk alone. It would be useful to know what the confidence interval was for the relative risk, were the groups similar with respect to other risk factors, were the two groups followed up for the same amount of time

5.1 Study group Intervention Control Epidural 23 33 No epidural 26 17 Total 49 50

5.2 ‘Risk’ of epidural for intervention group = 23/49 = 0.47 ‘Risk’ of epidural for control group = 33/50 = 0.66

Relative Risk = 0.47/0.66 = 0.71

5.3 Odds of epidural for intervention group = 23/26 = 0.885 Odds of epidural for control group = 33/17 = 1.941

Odds ratio = 0.885/1.941 = 0.46

The odds ratio is less than the relative risk. As the event of interest (epidural is common in both groups) the odds ratio and relative risk are not similar to each other and cannot be used interchangeably

5.4 The absolute risk difference is 0.47 – 0.66 = -0.19

The number needed to treat (for benefit) is 1/0.19 = 5.26 (ignoring the sign). Thus approximately 5 more women would need to labour in water to prevent one additional woman having an epidural

Page 221

Page 222 Correlation and Regression

1. If there is a strong correlation between two variables does this mean that they are causally related?

2. Table 1 below shows the heights and weights of ten elderly men.

Table 1: Height and weight of 10 elderly men Height (cm) Weight (kg) Fitted value 1 173 65 74 2 165 57 68 3 173 77 74 4 183 89 84 5 178 93 79 6 188 73 88 7 180 83 81 8 183 86 83 9 163 70 66 10 178 83 79

2.1 Draw a scatter plot of these data.

2.2 Does this suggest that there is an association or correlation between height and weight in elderly men?

2.3 The regression equation for these data is Weight = -76.7 + 0.88*Height. Comment on the slope b and intercept a statistics in this equation

Also included in the above table are the fitted values from the model. These are the predicted weight values for particular heights e.g. the predicted weight for someone 173 cm tall is 74kg.

2.4 Calculate the residuals from the model and display as (a) stem-and-leaf plot (b) dotplot

2.5 Do these suggest that the residuals are Normally distributed? Are there any other things that would be useful in assessing model fit?

3. In a study of blood pressure during pregnancy and fetal growth, 209 healthy women having their first pregnancy had 24 hour blood pressure reading taken in mid-pregnancy. The size of the baby was recorded at birth. The abstract included the following statement: ‘It was found that a 5mmHg increase in mean 24 hour diastolic blood pressure at 28 weeks gestation was associated with a 68 g (95% CI 3 to 132) decrease in birth weight… Maternal mean 24 hour diastolic blood pressure at 28 weeks gestation was also inversely associated with the infant’s ponderal index at birth (used to measure fetal growth status)… (P=0.06)’ (Churchill and Beevers, 1996. Obstetrics and Gynecology, 88, 455- 61).

3.1 What method would be used to calculate the 68g per 5mmHg?

Page 223 3.2 What are the assumptions underlying this method? How might you test these assumptions?

3.3 What is meant by ‘increase’ and ‘decrease’ here? Do they mean that when a woman’s blood pressure went down her baby’s weight went up?

4. As part of a study of the usefulness of information leaflets for pregnant women, the relationship between maternal age and birthweight of 3186 mothers and their babies was examined. (O’Cathain et al, BMJ 2002;324:643-647). Maternal age and birthweight were found to be significantly correlated (r=0.05, p=0.006).

4.1 What is the statistical null hypothesis for the above test? What is the alternative hypothesis

4.2 What is meant by ‘correlated’ and ‘r=0.05’? Draw a rough line on a graph illustrating what r=0.05 might look like.

4.3 What is meant by ‘p=0.006’? What can we conclude about the relationship between maternal age and birthweight?

Page 224 Correlation and Regression Solutions

1. No. There are various criteria, originally laid out by Bradford Hill that need to be fulfilled before a relationship can be deemed to be causal, these include temporality – i.e the disease must follow the exposure in time, plausibility, consistency (shown in various studies, in differing locations etc).

2. Table 1 below shows the heights and weights of ten elderly men.

Table 1: Height and weight of 10 elderly men Height Weight Fitted Residual (cm) (kg) value 1 173 65 74 -9 2 165 57 68 -11 3 173 77 74 3 4 183 89 84 5 5 178 93 79 14 6 188 73 88 -15 7 180 83 81 2 8 183 86 83 3 9 163 70 66 4 10 178 83 79 4

2.1 Scatter plot of weight against height for the data above.

95

90

85

80

75

W e i gh t ( K g) 70

65

60

55 160 165 170 175 180 185 190 Height (cm)

2.2 Yes, there appears to be a positive correlation between height and weight for this sample of elderly men. As height increases, so does weight.

2.3 The a coefficient is the coefficient for the intercept – that is the point at which the regression line passes through the y axis when x is zero. Thus, when height is 0 weight is -76.7. Clearly, height cannot be zero and weight cannot be -76.7kg. This illustrates why it is not a good idea to extrapolate your conclusions beyond the data. You may well get very odd results which are not plausible.

The b coefficient is the coefficient for the slope. It tells you how much the y- variable (in this case weight) changes for a single unit change in the x variable (in

Page 225 this case height). Here b is 0.88 and thus for every cm increase in height there is an increase in weight of 0.88kg.

2.4 Stem-and-Leaf Plot:

Frequency Stem & Leaf

2.00 -1 . 15 1.00 -0 . 9 6.00 0 . 233455 1.00 1 . 3

Stem width: 10.00 Each leaf: 1 case(s)

Dotplot:

15.00

10.00

5.00

0.00 R es i du a l v ue

-5.00

-10.00

2.5 Hard to say are there are so few values (only 10). In addition could do other residual plots eg scatterplot of the residuals against the fitted values. If this scatter is random (there is no tendency for the residuals to increase or decrease with the fitted value) it indicates that the assumption of constant variance is met.

3.1 Simple linear regression. This method estimates the nature of the relationship between two continuous variables, here mean 24-hour diastolic blood pressure and infant birthweight. The quantity given comes from directly from the slope of the line. The method works by calculating the line of best fit through the data using the principle of least squares.

Page 226

3.2 In this study blood pressure at 28 weeks is related to birthweight, a quantity which is measured once at birth. This analysis is not investigating how changes in blood pressure within individuals might affect the growth of the baby. The increase in birthweight associated with a decrease in blood pressure therefore refers to the mean effect rather than an effect for an individual. The difference in mean birthweight between two groups of women whose mean blood pressure differ by 5mmHg would therefore be 68g, with the direction of the difference being that women with lower mean blood pressure have greater mean birthweight.

4..1 The null hypothesis is that there is no relationship or correlation between maternal age and birthweight, i.e. population correlation coefficient (r) = 0.0. The alternative hypothesis is that there is a relationship or correlation between maternal age and birthweight i.e. population correlation coefficient ( ¹r) 0.0.

4.2 Two continuous variables are correlated if (a) when one has high values, the other has high values and when one has low values, the other has low values or (b) if when one has high values the other has low values and when one has low values, the other has high values,

r is the correlation coefficient which measures the strength of the linear relationship between two continuous variables. It values lies between —1 and +1 with 0 showing no linear relationship.

r = 0.05 is a positive correlation, showing that birthweight tends to be greater for babies with an older mother but the relationship is very weak and would be very hard to see on a scatter diagram.

Figure: Scatterplot of birthweight vs maternal age (n=3186)

4.3. The correlation coefficient is r = 0.05 (p= 0.006)

Page 227 What does P =0.006 mean: – Your results are unlikely when the null hypothesis is true Is this result statistically significant: – The result is statistically significant at the 5% level because the P-value is less than the significance level (a) set at 5% or 0.05 You decide: – That there is sufficient evidence to reject the null hypothesis and therefore you accept the alternative hypothesis that there is a correlation between maternal age and birthweight. – So in the population which these subjects represent, maternal age is related to birthweight, but the relationship is very weak. As demonstrated in the lecture, the larger the sample size the more significant the p-value for a given correlation coefficient. Looking at the scatterplot above and given the value of the correlation coefficient of 0.05, although this coefficient is significant, there is only very weak evidence (if any) that older mothers tend to give birth to heavier babies. We certainly could not conclude from these data that the relationship is causal.

Page 228