1 1. Data Transformations
Total Page:16
File Type:pdf, Size:1020Kb
260 Archives ofDisease in Childhood 1993; 69: 260-264 STATISTICS FROM THE INSIDE Arch Dis Child: first published as 10.1136/adc.69.2.260 on 1 August 1993. Downloaded from 1 1. Data transformations M J R Healy Additive and multiplicative effects adding a constant quantity to the correspond- In the statistical analyses for comparing two ing before readings. Exactly the same is true of groups of continuous observations which I the unpaired situation, as when we compare have so far considered, certain assumptions independent samples of treated and control have been made about the data being analysed. patients. Here the assumption is that we can One of these is Normality of distribution; in derive the distribution ofpatient readings from both the paired and the unpaired situation, the that of control readings by shifting the latter mathematical theory underlying the signifi- bodily along the axis, and this again amounts cance probabilities attached to different values to adding a constant amount to each of the of t is based on the assumption that the obser- control variate values (fig 1). vations are drawn from Normal distributions. This is not the only way in which two groups In the unpaired situation, we make the further of readings can be related in practice. Suppose assumption that the distributions in the two I asked you to guess the size of the effect of groups which are being compared have equal some treatment for (say) increasing forced standard deviations - this assumption allows us expiratory volume in one second in asthmatic to simplify the analysis and to gain a certain children. Your reply might well be 'Oh, amount of power by utilising a single pooled perhaps +20%'. This implies a multiplicative estimate of variance. It is necessary to stress effect of the treatment. A child with an initial that the importance of these assumptions is level of 2-01 would be expected to increase this often exaggerated. Although they form part of by 0-41, one with an initial level of 3-01 would the mathematical framework, departures from be expected to increase this by 061. The dif- them have very little effect upon the outcome ference between the 'before' and 'after' read- of the analyses unless they are particularly ings varies systematica1ly with the level of marked. response. If data following this pattern were to http://adc.bmj.com/ A third assumption is a good deal less be analysed using the assumption of an addi- obvious and a good deal more important. This tive treatment effect, this kind of systematic is the assumption that the effect of the factor discrepancy between the (after-before) dif- which defines the two groups is additive, mean- ferences would be treated as if it was random ing by this that it produces (apart from error) a and interpreted as being due to error. constant difference between the readings in the two groups. Consider for example a paired situation, as when we compare readings before Logarithmic transformation of data on October 1, 2021 by guest. Protected copyright. and after a treatment on the same subjects. For A multiplicative treatment effect of this kind the purposes of analysis, we form the dif- can be converted to an additive effect by trans- ferences between the before and after readings forming all the original readings to logarithms for each subject, and the variability between before doing the analysis. Your memories of these differences is regarded as error. Put logarithms may be of a rather outdated aid to another way, we assume that we can derive the arithmetic, but their usefulness in statistical after readings on the subjects (error apart) by analysis goes far beyond this. For statistical purposes, the only thing you need to remember about logarithms is that the logarithm of the product of two numbers is the sum of the loga- rithms of the numbers, while the logarithm of the ratio of two numbers is the difference between the logarithms. In symbols log(axb) =log(a) +log(b) log(a/b) =log(a) -log(b) There are various kinds of logarithm, 23 Coleridge Court, depending upon what number is chosen to Milton Road, have a logarithm of 1I0. So-called 'common' Harpenden, Herts ALS 5LD I5I 2 logarithms are based on log(10)=1-0 (and 5 10 15 20 2E consequently log(100) =2-0, log(1000) =3-0, Correspondence to: Professor Healy. Figure 1 Two distributions differing by a constant log(01)=-1-0 and so on, using the rules No reprints available. amount. above). Mathematicians often use logarithms Data transformations 261 based on log(e)= 1-0, where e is a magic num- Ifinstead you analyse the differences ofthe logs ber equal to 2-7818 ... and rather presumpt- of the original readings, that is the logs of the uously call these 'natural' logarithms. The ratios, you will get t=0-104/0-0067=15-52, a numbers that describe the successive titres in a considerably larger value. Much more to the Arch Dis Child: first published as 10.1136/adc.69.2.260 on 1 August 1993. Downloaded from series of twofold dilutions are actually 'binary' point, the mean and standard error for the logarithms based on log(2)= 1-0. The numbers differences are +0-80 and 0-093, so that the 10, e, and 2 are called the bases of the loga- 95% confidence interval for the true mean rithms. Note that log(l)=0 0, no matter what difference (using the 5% significance value the base is (you can prove this ifyou like, using of t with 6 degrees of freedom) is the second of the above rules). It is usually 0-80 ±2-447 x 0-093 = + 0 57 to + 1 03. For the unimportant for statistical purposes what base logarithms of the ratios the mean and standard is used, since the different kinds of logarithm error are 0-104 and 0-0067, giving 95% differ by no more than a scale factor. The limits for the true mean difference of 'natural' logarithm of a number, for example, 0-104±2A447X0*0067=0-088 to 0-120. This is simply the 'common' logarithm multiplied interval relates to the differences between the by 2-3026. However, if you have to quote logarithms of the original data; taking the logarithmic values in the text or tables of a antilogs of these figures, the 95% confidence paper, it is essential to specify the base of the interval for the true ratio in the original data is logarithms so as to avoid ambiguity - you can from 1 22 to 1-32, a much more incisive result. write logl0, loge, or log2 as required. I person- ally use logarithmic values so often that I keep in my head a two figure table of common Transformation ofdata and its effects logarithms - In some circumstances an additive effect is rather implausible, so that it should be con- sidered whether transforming the data to loga- No 1 2 3 4 5 6 7 8 9 10 log 0-00 0-30 0-48 0-60 0-70 0-78 0-85 0-90 0-95 1 -00 rithms would be advantageous. One such situation arises when the data values are neces- sarily positive (as with, for example, the You may like to satisfy yourself that, for concentrations of some chemical in the blood) example, log(24) = 1-38, log(0 15) = -0-82. but are close to a definite zero point, in the The treatment effect of +20% that we spoke sense that zero is within one or two standard of amounts to assuming that (apart from error) deviations ofthe mean so that the coefficient of the treatment multiplies all the 'before' read- variation (the ratio of the standard deviation ings by the constant factor 1-20. When the over the mean) is 50% or more. Suppose that observations are transformed to logarithms, this describes the distribution of the control the treatment effect correspondingly increases values in a trial; then with an additive treat- all the 'before' log observations by the constant ment effect, the treated values would have to amount 0 079, which is the logarithm (to the start abruptly at some non-zero value, an http://adc.bmj.com/ base 10) of 1-20. unlikely state of affairs. A multiplicative effect A simple example, using paired data and as shown in fig 2 is a more realistic model for logs to the base 10, is shown in the table. Here data of this kind. A great many measurements, you can see quite a clear tendency for the dif- both physiological and biochemical, exhibit ference within each pair of original readings to this behaviour. increase with the average level. The ratios But it is worth noticing that this situation, show no such systematic tendency, suggesting marked by a large ratio of standard deviation that the data are closer to showing a constant to mean, is also unlikely to meet the other two on October 1, 2021 by guest. Protected copyright. ratio rather than a constant difference, in other assumptions I have mentioned, of Normality words, that the effect of treatment is closer to and equal variability. With a mean close to being multiplicative than additive. zero in the above sense, a Normal distribution If you calculate the paired t value from the would necessarily imply an appreciable prob- differences, you will get t=0-80/0-093=8-60. ability for the impossible negative values. Figure 2 also shows that, with a fixed Logaithmic transformation terminus at zero for both distributions, the Before After Difference Ratio Original data 2-0 2-4 +0-4 1-20 2-2 2-9 +0-7 1-32 2-5 3-2 +0-7 1-28 2-8 3-6 +0-8 1-29 3-2 4.3 +1-1 1-34 3.7 4.5 +0-8 1-22 4.3 5-4 +1-1 1-26 Mean (SE) +0-80 (0-093) Logged data 0-301 0-380 +0-079 0-342 0 462 +0-120 0-398 0-505 +0-107 0-447 0-556 +0 109 0-505 0-633 +0- 128 0-568 0-653 +0-085 0-633 0-732 +0-099 Mean (SE) +0-104 (0-0067) Figure 2 Two distributions differing by a multiplicative factor.