<<

tutorial

The

Jenny V. Freeman and Steven A. Julious

Medical Group, School of Health and Related Research, University of Sheffield, Community Sciences Centre, Northern General Hospital, Herries Road, Sheffield, UK

different values occurring, and they exist for both Introduction continuous and categorical measurements.

In addition to the Normal distribution (described later in The first two tutorials in this series have focused on this note), there are many other theoretical distributions, displaying and simple methods for describing and including the chi-squared, binomial and Poisson summarising data. There has been little discussion of distributions (these will be discussed in later tutorials). . In this note, we will start to explore some Each of these theoretical distributions is described by a of the basic concepts underlying much statistical particular mathematical expression (formally referred to methodology. We will describe the basic theory as a model), and for each model there exist summary underlying the Normal distribution and the link between measures, known as parameters, which completely empirical distributions (the observed describe that particular distribution. In practice, distribution of data in a sample) and theoretical parameters are usually estimated by quantities calculated probability distributions (the theoretical distribution of from the sample, and these are known as statistics; that data in a population). In addition, we will introduce the is, a is a quantity calculated from a sample in idea of a confidence . order to estimate an unknown parameter in a population. For example, the Normal distribution is completely Theoretical probability distributions characterised by the population (µ) and population (σ), and these are estimated by the sample mean ( ) and sample standard deviation (s) Since it is rarely possible to collect information on an respectively. entire population, the aim of many statistical analyses is to use information from a sample to draw conclusions (or ‘make inferences’) about the population of interest. These The Normal distribution inferences are facilitated by making assumptions about the underlying distribution of the measurement of interest The Normal, or Gaussian, distribution (named in honour in the population as a whole, and by applying an of the German mathematician C. F. Gauss, 1777–1855) is appropriate theoretical model to describe how the the most important theoretical in measurement behaves in the population. (Note that prior statistics. At this point, it is important to stress that, in this to any analysis, it is usual to make assumptions about the context, the word ‘normal’ is a statistical term and is not underlying distribution of the measurement being studied. used in the dictionary or clinical sense of conforming to These assumptions can then be investigated through what would be expected. Thus, in order to distinguish various plots and figures for the observed data – e.g. a between the two, statistical and dictionary ‘normal’, it is for continuous data. These investigations are conventional to use a capital letter when referring to the referred to as diagnostics and will be discussed Normal distribution. throughout subsequent notes.) In the context of this note, the population is a theoretical concept used for The basic properties of the Normal distribution are describing an entire group, and one way of describing the outlined in table 1. The distribution curve of data that are distribution of a measurement in a population is by use of Normally distributed has a characteristic shape; it is bell- a suitable theoretical probability distribution. Probability shaped, and symmetrical about a single peak (figure 1). distributions can be used to calculate the probability of For any given value of the mean, populations with a small

standard deviation have a distribution clustered close to Table 1: Properties of the Normal distribution the mean (µ), while those with a large standard deviation have a distribution that is widely spread along the measurement axis, and the peak is more flattened. 1. It is bell-shaped and has a single peak (unimodal) As mentioned earlier, the Normal distribution is described completely by two parameters, the mean (µ) and the 2. Symmetrical about the mean standard deviation (σ). This that for any Normally 3. Uniquely defined by two parameters, the distributed variable, once the mean and (σ2) are mean (µ) and standard deviation (σ) known (or estimated), it is possible to calculate the 4. The mean, and mode all coincide probability distribution for that population. 5. The probability that a Normally distributed An important feature of a Normal distribution is that 95% , x, with mean µ and of the data fall within 1.96 standard deviations of the standard deviation σ lies between the limits mean – the unshaded area in the middle of the curve in (µ – 1.96σ) and (µ + 1.96σ) is 0.95, i.e. 95% figure 1. A summary measure for a sample often quoted of the data for a Normally distributed random is the two values associated with the mean +/- 1.96 x variable will lie between the limits (µ – 1.96σ) standard deviation ( +/- 1.96s). These two values are and (µ + 1.96σ)* termed the Normal , and represent the range within which 95% of the data are expected to lie. Note that 6. The probability that a Normally distributed 68.3% of data lie within 1 standard deviation of the mean, random variable, x, with mean µ and while virtually all of the data (99.7%) will lie within 3 standard deviation σ lies between the limits standard deviations (95.5% will lie within 2). The Normal (µ – 2.44σ) and (µ + 2.44σ) is 0.99 distribution is important, as it underpins much of the 7. Any position on the horizontal axis of figure 1 subsequent statistical theory outlined both in this and can be expressed as a of standard later tutorials, such as the calculation of confidence deviations away from the mean value intervals and linear modelling techniques. *This fact is used for calculating the 95% for Normally distributed data.

Figure 1: The Normal distribution

According to the , if you were to The Central Limit Theorem select repeated random samples of the same size from (or the law of large ) this distribution, and then calculate the means of these different samples, the distribution of these sample means The Central Limit Theorem states that, given any series of would be approximately Normal, and this approximation independent, identically distributed random variables, their would improve as the size of each sample increased. means will tend to a Normal distribution as the number of Figure 2a represents the distribution of the sample means variables increases. Put another way, the distribution of for 500 samples of size 5. Even with such a small sample sample means drawn from a population will be Normally size, the approximation to the Normal is remarkable; distributed whatever the distribution of the actual data in repeating the with samples of size 50 improves the population as long as the samples are large enough. the fit to the Normal distribution (figure 2b). The other noteworthy feature of these two figures is that as the size In order to illustrate this, consider the random numbers of the samples increases (from 5 to 50), the spread of the 0–9. The distribution of these numbers in a random means is decreased. numbers table would be uniform. That is, that each number has an equal probability of being selected, and the shape Each mean estimated from a sample is an unbiased of the theoretical distribution is represented by a rectangle. estimate of the true population mean, and by repeating the

a

b

Figure 2. Distribution of means from 500 samples a: Samples of size 5, mean=4.64, sd=1.29 b: Samples of size 50, mean=4.50, sd=0.41

many times we can obtain a sample of plausible discuss confidence intervals further in subsequent notes in values for the true population mean. Using the Central context with hypothesis tests and P-values. Limit Theorem, we can infer that 95% of sample means will In order to calculate the confidence interval, we need to be lie within 1.96 standard deviations of the population mean. able to estimate the standard deviation of the sample As we do not usually know the true population mean, the mean. It is defined as the sample standard deviation, s, more important inference is that with the sample mean we divided by the square root of the number of individuals in are 95% confident that the population mean will fall within the sample, , and is usually referred to as the standard 1.96 standard deviations of the sample mean. In reality, as error. In order to avoid confusion, it is worth remembering we usually only take a single sample, we can use the that with use of the standard deviation (of all individuals in Central Limit Theorem to construct an interval within which the sample), you can make inferences about the spread of we are reasonably confident the true population mean will the measurement within the population for individuals, lie. This range of plausible values is known as the while with use of the , you can make confidence interval, and the formula for the confidence inferences about the spread of the means: the standard interval for the mean is given in table 2. Technically deviation is for describing (the spread of data), while the speaking, the 95% confidence interval is the range of standard error is for estimating (how precisely the mean values within which the true population mean would lie 95% has been pinpointed). of the time if a study was repeated many times. Crudely speaking, the confidence interval gives a range of plausible values for the true population mean. We will Summary

Table 2: Formula for the confidence interval In this tutorial, we have outlined the basic properties of the for a mean Normal distribution and discussed the Central Limit Theorem and outlined its importance to statistical theory. The Normal distribution is fundamental to many of the tests to of outlined in subsequent tutorials, while the principles of the Central Limit Theorem enable us s = sample standard deviation and n = number to calculate confidence intervals and make inferences of individuals in the sample. about the population from which the sample is taken.