Introduction to Biostatistics

Introduction to Biostatistics Larry Winner Department of Statistics University of Florida July 8, 2004 2 Contents 1 Introduction 7 1.1 Populations and Samples . 7 1.2 Types of Variables . 8 1.2.1 Quantitative vs Qualitative Variables . 8 1.2.2 Dependent vs Independent Variables . 10 1.3 Parameters and Statistics . 10 1.4 Graphical Techniques . 12 1.5 Basic Probability . 17 1.5.1 Diagnostic Tests . 21 1.6 Basic Study Designs . 23 1.6.1 Observational Studies . 23 1.6.2 Experimental Studies . 25 1.6.3 Other Study Designs . 27 1.7 Reliability and Validity . 29 1.8 Exercises . 31 2 Random Variables and Probability Distributions 35 2.1 The Normal Distribution . 35 2.1.1 Statistical Models . 39 2.2 Sampling Distributions and the Central Limit Theorem . 39 2.2.1 Distribution of Y .................................. 40 2.3 Exercises . 40 3 Statistical Inference – Hypothesis Testing 45 3.1 Introduction to Hypothesis Testing . 45 3.1.1 Large–Sample Tests Concerning µ1 − µ2 (Parallel Groups) . 47 3.2 Elements of Hypothesis Tests . 48 3.2.1 Significance Level of a Test (Size of a Test) . 48 3.2.2 Power of a Test . 49 3.3 Sample Size Calculations to Obtain a Test With Fixed Power . 51 3.4 Small–Sample Tests . 52 3.4.1 Parallel Groups Designs . 52 3.4.2 Crossover Designs . 56 3 4 CONTENTS 3.5 Exercises . 59 4 Statistical Inference – Interval Estimation 65 4.1 Large–Sample Confidence Intervals . 66 4.2 Small–Sample Confidence Intervals . 67 4.2.1 Parallel Groups Designs . 67 4.2.2 Crossover Designs . 68 4.3 Exercises . 69 5 Categorical Data Analysis 71 5.1 2 × 2 Tables . 72 5.1.1 Relative Risk . 72 5.1.2 Odds Ratio . 74 5.1.3 Extension to r × 2 Tables . 76 5.1.4 Difference Between 2 Proportions (Absolute Risk) . 76 5.1.5 Small–Sample Inference — Fisher’s Exact Test . 78 5.1.6 McNemar’s Test for Crossover Designs . 79 5.1.7 Mantel–Haenszel Estimate for Stratified Samples . 81 5.2 Nominal Explanatory and Response Variables . 83 5.3 Ordinal Explanatory and Response Variables . 85 5.4 Nominal Explanatory and Ordinal Response Variable . 87 5.5 Assessing Agreement Among Raters . 89 5.6 Exercises . 91 6 Experimental Design and the Analysis of Variance 97 6.1 Completely Randomized Design (CRD) For Parallel Groups Studies . 97 6.1.1 Test Based on Normally Distributed Data . 98 6.1.2 Test Based on Non–Normal Data . 103 6.2 Randomized Block Design (RBD) For Crossover Studies . 104 6.2.1 Test Based on Normally Distributed Data . 104 6.2.2 Friedman’s Test for the Randomized Block Design . 108 6.3 Other Frequently Encountered Experimental Designs . 109 6.3.1 Factorial Designs . 109 6.3.2 Crossover Designs With Sequence and Period Effects . 114 6.3.3 Repeated Measures Designs . 116 6.4 Exercises . 119 7 Linear Regression and Correlation 129 7.1 Least Squares Estimation of β0 and β1 . 130 7.1.1 Inferences Concerning β1 .............................134 7.2 Correlation Coefficient . 136 7.3 The Analysis of Variance Approach to Regression . 137 7.4 Multiple Regression . 139 CONTENTS 5 7.4.1 Testing for Association Between the Response and the Set of Explanatory Variables . 140 7.4.2 Testing for Association Between the Response and an Individual Explanatory Variable . 140 7.5 Exercises . 144 8 Logistic and Nonlinear Regression 151 8.1 Logistic Regression . 151 8.2 Nonlinear Regression . 156 8.3 Exercises . 158 9 Survival Analysis 163 9.1 Estimating a Survival Function — Kaplan–Meier Estimates . 163 9.2 Log–Rank Test to Compare 2 Population Survival Functions . 166 9.3 Relative Risk Regression (Proportional Hazards Model) . 168 9.4 Exercises . 171 10 Special Topics in Pharmaceutics 179 10.1 Assessment of Pharmaceutical Bioequivalence . 179 10.2 Dose–Response Studies . 181 10.3 Exercises . 183 A Statistical Tables 187 B Bibliography 193 6 CONTENTS Chapter 1 Introduction These notes are intended to provide the student with a conceptual overview of statistical methods with emphasis on applications commonly used in pharmaceutical and epidemiological research. We will briefly cover the topics of probability and descriptive statistics, followed by detailed descriptions of widely used inferential procedures. The goal is to provide the student with the information needed to be able to interpret the types of studies that are reported in academic journals, as well as the ability to perform such analyses. Examples are taken from journals in the pharmaceutical and health sciences fields. 1.1 Populations and Samples A population is the set of all measurements of interest to a researcher. Typically, the population is not observed, but we wish to make statements or inferences concerning it. Populations can be thought of as existing or conceptual. Existing populations are well–defined sets of data containing elements that could be identified explicitly. Examples include: PO1 CD4 counts of every American diagnosed with AIDS as of January 1, 1996. PO2 Amount of active drug in all 20–mg Prozac capsules manufactured in June 1996. PO3 Presence or absence of prior myocardial infarction in all American males between 45 and 64 years of age. Conceptual populations are non–existing, yet visualized, or imaginable, sets of measurements. This could be thought of characteristics of all people with a disease, now or in the near future, for instance. It could also be thought of as the outcomes if some treatment were given to a large group of subjects. In this last setting, we do not give the treatment to all subjects, but we are interested in the outcomes if it had been given to all of them. Examples include: PO4 Bioavailabilities of a drug’s oral dose (relative to i. v. dose) in all healthy subjects under identical conditions. PO5 Presence or absence of myocardial infarction in all current and future high blood pressure patients who receive short–acting calcium channel blockers. 7 8 CHAPTER 1. INTRODUCTION PO6 Positive or negative result of all pregnant women who would ever use a particular brand of home pregnancy test. Samples are observed sets of measurements that are subsets of a corresponding population. Samples are used to describe and make inferences concerning the populations from which they arise. Statistical methods are based on these samples having been taken at random from the population. However, in practice, this is rarely the case. We will always assume that the sample is representative of the population of interest. Examples include: SA1 CD4 counts of 100 AIDS patients on January 1, 1996. SA2 Amount of active drug in 2000 20–mg Prozac capsules manufactured during June 1996. SA3 Prior myocardial infarction status (yes or no) among 150 males aged 45 to 64 years. SA4 Bioavailabilities of an oral dose (relative to i.v. dose) in 24 healthy volunteers. SA5 Presence or absence of myocardial infarction in a fixed period of time for 310 hypertension patients receiving calcium channel blockers. SA6 Test results (positive or negative) among 50 pregnant women taking a home pregnancy test. 1.2 Types of Variables 1.2.1 Quantitative vs Qualitative Variables The measurements to be made are referred to as variables. This refers to the fact that we acknowledge that the outcomes (often referred to as endpoints in the medical world) will vary among elements of the population. Variables can be classified as quantitative (numeric) or qualitative (categorical). We will use the terms numeric and categorical throughout this text, since quantitative and qualitative are so similar. The types of analyses used will depend on what type of variable is being studied. Examples include: VA1 CD4 count represents numbers (or counts) of CD4 lymphocytes per liter of peripheral blood, and thus is numeric. VA2 The amount of active drug in a 20–mg Prozac capsule is the actual number of mg of drug in the capsule, which is numeric. Note, due to random variation in the production process, this number will vary and never be exactly 20.0–mg. VA3 Prior myocardial infarction status can be classified in several ways. If it is classified as either yes or no, it is categorical. If it is classified as number of prior MI’s, it is numeric. Further, numeric variables can be broken into two types: continuous and discrete. Continuous variables are values that can fall anywhere corresponding to points on a line segment. Some examples are weight and diastolic blood pressure. Discrete variables are variables that can take on only a finite (or countably infinite) number of outcomes. Number of previous myocardial infarctions 1.2. TYPES OF VARIABLES 9 and parity are examples of discrete variables. It should be noted that many continuous variables are reported as if they were discrete, and many discrete variables are analyzed as if they were continuous. Similarly, categorical variables also are commonly described in one of two ways: nominal and ordinal. Nominal variables have distinct levels that have no inherent ordering. Hair color and sex are examples of variables that would be described as nominal. On the other hand, ordinal variables have levels that do follow a distinct ordering. Examples in the medical field typically relate to degrees of change in patients after some treatment (such as: vast improvement, moderate improvement, no change, moderate degradation, vast degradation/death). Example 1.1 In studies measuring pain or pain relief, visual analogue scales are often used. These scales involve a continuous line segment, with endpoints labeled as no pain (or no pain relief) and severe (or complete.

Introduction to Biostatistics

BIOGRAPHICAL SKETCH Garrett M. Fitzmaurice Associate Professor Of

TUTORIAL in BIOSTATISTICS: the Self-Controlled Case Series Method

Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists

Biometrics & Biostatistics

The American Statistician

Epidemiology and Biostatistics (EPBI) 1

Medicine Safety “Pharmacovigilance” Fact Sheet

Biostatistics (EHCA 5367)

Biostatistics

Biostatistics (BIOSTAT) 1

Econometrica 0012-9682 Biostatistics 1465-4644 J R

Curriculum Vitae Jianwen Cai