Median Regression

Volume 21, Number 1, February/March 2015 Median Regression Sergey Tarima, PhD, Division of Biostatistics, MCW Median versus Median Ordinary least-squares regression models predict conditional mean and they became very popular in medical research. These models were extended to include a wide class of generalized linear models which include but not limited to logistic, Poisson, exponential and normal regression models. What unites all of these regressions is that they are modelling conditional mean. The problem, however, is that mean itself may be not the best, and certainly is not the only one way to describe behavior of a random variable. For example, Figure 1 shows a histogram of 1000 draws from a distribution of a normal random variable X with zero mean and standard deviation equal to two. If we formally calculate estimates of the mean and median based on this sample of 1000 observations we will get 1.08 and 1.05 respectively. These are the estimates of the same population parameters, mean =1 and median =1. © 2015 by Division of Biostatistics, Medical College of Wisconsin Volume 21, Number 1, February/March 2015 Figure 1: histogram of 1000 draws from a normally distributed random variable with zero mean and standard deviation equal to two. The situation becomes more complex when data are coming from some other distribution. For example, Figure 2 reports a histogram based on 1000 observations from a lognormal distribution with µ=0 and and σ=1. We should note that the parameters of the lognormal distribution, µ and σ, have a different meaning and do not refer to the population mean and standard deviation. For example, the population mean and median are, exp(µ+σ2/2) and exp(µ). In our case the mean is exp(3/2) and median is exp(1). Figure 2: histogram of 1000 draws from a log-normally distributed random variable with zero mean and standard deviation equal to one. Using the data from Figure 2, the mean and median are estimated as 4.55 and 2.64. This example highlights the fact that median a substantially more robust parameter than mean. Median is less affected by the presence of too low or too high observations. Mean versus median regression We demonstrate the two regression models using statistical software called “R”, which is freely downloadable from http://cran.r-project.org/. © 2015 by Division of Biostatistics, Medical College of Wisconsin Volume 21, Number 1, February/March 2015 Normal model Consider an example when an outcome Y = 0 + 1X, where X is a standard half-normal (zero mean and unit variance), Y is a normal random variable with mean = X and unit variance. Figure 3 shows a scatterplot of 100 observations generated from this model. We can visually see a positive linear association between X and Y. When we fit a linear regression model (see the following R code and its output), R code x <- rnorm(100) y <- rnorm(100, mean = x, sd=1) summary(lm(y~x)) Output Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.04667 0.17062 0.274 0.785 x 0.95973 0.17109 5.610 1.88e-07 *** we see that the linear association with the mean (and with the median, because in this case mean and median are the same) exists and significant. The slope parameter describing this linear association is estimated as 0.95 with a standard error of 0.17. Figure 3: 100 observations from Y=0+1*X, where X follows a half-normal distribution. © 2015 by Division of Biostatistics, Medical College of Wisconsin Volume 21, Number 1, February/March 2015 We can also fit a median regression to the same dataset: R code: x <- rnorm(100) y <- rnorm(100, mean = x, sd=1) summary(rq(y~x),se=“iid”) =============================================== Output: Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.03999 0.18739 -0.21342 0.83144 x 0.97197 0.18791 5.17253 0.00000 The findings are comparable, now the slope parameter is 0.97 and the standard error is 0.18, only slightly higher than the standard error we observed for the linear regression fit. Contaminated normal model Now we make the previous scenario more problematic by adding sample contamination with y[1:10] <- pmax(y[1:10],-y[1:10])*30. This code converts the first 10 Y-observations into their absolute values and multiplies them by 30. The results can be seen on Figure 4. © 2015 by Division of Biostatistics, Medical College of Wisconsin .

Median Regression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support