Quick viewing(Text Mode)

Median Regression

Median Regression

Volume 21, Number 1, February/March 2015

Median Regression

Sergey Tarima, PhD, Division of , MCW

Median versus Median Ordinary least-squares regression models predict conditional and they became very popular in medical research. These models were extended to include a wide class of generalized linear models which include but not limited to logistic, Poisson, exponential and normal regression models. What unites all of these regressions is that they are modelling conditional mean. The problem, however, is that mean itself may be not the best, and certainly is not the only one way to describe behavior of a . For example, Figure 1 shows a of 1000 draws from a distribution of a normal random variable X with zero mean and equal to two. If we formally calculate estimates of the mean and median based on this sample of 1000 observations we will get 1.08 and 1.05 respectively. These are the estimates of the same population , mean =1 and median =1.

© 2015 by Division of Biostatistics, Medical College of Wisconsin

Volume 21, Number 1, February/March 2015

Figure 1: histogram of 1000 draws from a normally distributed random variable with zero mean and standard deviation equal to two.

The situation becomes more complex when are coming from some other distribution. For example, Figure 2 reports a histogram based on 1000 observations from a lognormal distribution with µ=0 and and σ=1. We should note that the parameters of the lognormal distribution, µ and σ, have a different meaning and do not refer to the population mean and standard deviation. For example, the population mean and median are, exp(µ+σ2/2) and exp(µ). In our case the mean is exp(3/2) and median is exp(1).

Figure 2: histogram of 1000 draws from a log-normally distributed random variable with zero mean and standard deviation equal to one.

Using the data from Figure 2, the mean and median are estimated as 4.55 and 2.64. This example highlights the fact that median a substantially more robust than mean. Median is less affected by the presence of too low or too high observations.

Mean versus median regression

We demonstrate the two regression models using statistical software called “R”, which is freely downloadable from http://cran.r-project.org/.

© 2015 by Division of Biostatistics, Medical College of Wisconsin

Volume 21, Number 1, February/March 2015

Normal model

Consider an example when an outcome Y = 0 + 1X, where X is a standard half-normal (zero mean and unit ), Y is a normal random variable with mean = X and unit variance. Figure 3 shows a scatterplot of 100 observations generated from this model. We can visually see a positive linear association between X and Y. When we fit a model (see the following R code and its output),

R code x <- rnorm(100) y <- rnorm(100, mean = x, sd=1) summary(lm(y~x))

Output Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.04667 0.17062 0.274 0.785 x 0.95973 0.17109 5.610 1.88e-07 *** we see that the linear association with the mean (and with the median, because in this case mean and median are the same) exists and significant. The parameter describing this linear association is estimated as 0.95 with a of 0.17.

Figure 3: 100 observations from Y=0+1*X, where X follows a half-.

© 2015 by Division of Biostatistics, Medical College of Wisconsin

Volume 21, Number 1, February/March 2015

We can also fit a median regression to the same dataset:

R code: x <- rnorm(100) y <- rnorm(100, mean = x, sd=1) summary(rq(y~x),se=“iid”) ======Output: Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.03999 0.18739 -0.21342 0.83144 x 0.97197 0.18791 5.17253 0.00000

The findings are comparable, now the slope parameter is 0.97 and the standard error is 0.18, only slightly higher than the standard error we observed for the linear regression fit.

Contaminated normal model Now we make the previous scenario more problematic by adding sample contamination with y[1:10] <- pmax(y[1:10],-y[1:10])*30.

This code converts the first 10 Y-observations into their absolute values and multiplies them by 30. The results can be seen on Figure 4.

© 2015 by Division of Biostatistics, Medical College of Wisconsin