<<

2003 Joint Statistical Meetings - Section on Statistical Education

AN INVESTIGATION OF THE -MEDIAN METHOD OF

Elizabeth J. Walters, Christopher H. Morrell, and Richard E Auer Mathematical Sciences Department, Loyola College in Maryland 4501 North Charles Street, Baltimore, MD 21210-2699 [email protected]

Keywords: Linear regression; Median-median line; line

1. Introduction what is called the median-median line, on the other hand, is not so well known or understood. education has been impacted greatly by This paper considers the performance of these two two trends. One trend is the continual advancement in line-fitting techniques as an aide to budding and computer technology that has systematically increased the their teachers as they encounter exploratory analysis. power and refinement of statistical data analysis. In the We intend to illuminate the behavior of the relatively 1960's, main frame computers revolutionized statistical unknown median-median line with several data sets and methodology. During the 1970's, personal computers placed through a small simulation study. Readers may also computing capability right on the desks of individual compare how the median-median line reacts in various data researchers. This evolution has continued to progress and settings to the performance of the least squares line. now grammar school students learn data skills literally in the palm of their hands using graphing calculators. 2. Historical Evolution of the Median-Median Line The second trend is the extension of statistics courses earlier in students’ education. Younger students are The Median-Median Line traces back to a line- being exposed to the question of what truths lurk beneath fitting approach proposed by Wald (1940). He suggested a the surface of data. These students are now being trained to very simple method where the points on a scatterplot are use statistics as the primary tool for researching any topic separated into a left half and a right half based on the within any field. median of the x-scores in a sample of bivariate data. The Evidence of both trends are seen when elementary of the x-scores and the y-scores are calculated using and secondary school students use graphing calculators to the data from only the left half of the scatterplot and then perform data analysis on exams and homework assignments. calculated using only the data on the right half.

Much of the early statistical training is being accomplished Concentrating on the two points, (), yx RR and ( ,,yx LL ) by teaching the basics of what has been termed exploratory Wald proposed finding a line connecting these points that is data analysis. then adjusted up or down to better fit the full array of points As the first pioneer in exploratory data analysis, on the scatterplot. (The subscripts R and L denote which Tukey (1971, p. v.) effectively defined it as looking at data half of the data is used and no subscript implies the means to see what it seems to say. It concentrates on simple are taken over the entire sample.) His line of fit has arithmetic and easy-to-draw pictures. It regards whatever = ( − ) ( − xxyyb ) and y-intercept = − xbya . appearances we have recognized as partial descriptions, and LRLR tries to look beneath them for new insights. In his A similar procedure suggested by Nair and recommendations for what type of activity belongs in Shrivastava (1942) breaks up the points on a scatterplot into introductory statistics coursework, Hahn (1988, p. 27) said: three regions with each region containing about the same “When we get to data analysis, we should stress graphical number of points. The x and y means of the points in the methods or exploratory data analysis, over formal statistical left and right regions are then used to find the slope of the procedures.” Cobb and Moore (1997, p. 815) echoed the line of fit much as Wald suggested. sentiments of Hahn saying: “Students like exploratory Brown and Mood (1951) used the two-region analysis and find that they can do it, a substantial bonus approach but found the slope of the line of fit using when teaching a subject feared by many. Engaging them in place of means. The primary advantage of using this early on in the interpretation of results, before the harder measure of center comes from the median’s inherent ability ideas come to their attention, can help establish good habits to resist the strong effect of . Most students of when you get to inference.” statistics know that the is greatly impacted by Often, one major goal of young statisticians as they cases since they are included with equal weight with the rest learn exploratory data analysis is studying the relationship of the data in the sum of the scores. But the median takes on between variables. When the data is bivariate in two the same value whether the largest score in a is just numerical variables, this usually involves fitting a straight somewhat larger than the rest of the data or is much larger line through points on a scatterplot. The graphing calculator than the second biggest score. In the context of fitting is programmed with two methods of line-fitting for bivariate points on a scatterplot, this implies that a single point far data. One method, fitting the least squares line, is a from the general sloping trend of the rest of the points would procedure well-studied and routinely applied. But finding not apply such a large “tug” on the location of the line of fit if the median is used to find the line.

4419 2003 Joint Statistical Meetings - Section on Statistical Education

80 Readers may recall that the least squares line is 70 found by minimizing the sum of the squared distances that 60 each point lies from the line. Since these distances are 50 squared, Hartwig and Dearing (1979, p. 34) noted that 40 “cases lying farther and farther from the regression line 30 increase the sum of the squared residuals at an increasing rate....[and the line] will have to come reasonably close to manatees killed 20 them to satisfy the least squares criterion and, therefore, the 10 least squares regression line will lack resistance to the 0 excessive influence of a few atypical cases.” This means 45 55 65 75 85 that Brown and Mood’s method is not only simple to apply, boat registrations (in 10,000s) but also has the advantage of not allowing outlying cases to have undue impact on determining the line of fit. Figure 1. Scatterplot of manatees killed by powerboats Like Brown and Mood, Tukey (1971) utilized the versus number of boat registrations. medians in finding his line of fit, but he did so borrowing ───── Median-Median the three region approach of Nair and Shrivastava. His line ── ── ── Least Squares of fit, called the resistant line, is considered a basic methodology of exploratory data analysis. And it is Tukey’s resistant line that serves as the basis for the graphing II. A scatterplot of scores on a test of mental aptitude calculator’s median-median line. (Gesell Adaptive Test) versus age at first word, for 21 For young statisticians who wish to learn more children (Moore & McCabe, 1999, p 149), shows a about the broad methodology of exploratory data analysis, moderate negative correlation (r = -0.640, p=0.002). Three Velleman and Hoaglin (1981) offer a gentle review of the points in this scatterplot are notable. Two children (*) entire subject. To learn more about the median-median line began speaking at a much later age than the rest of the at a more sophisticated mathematical level, readers are children. Another child (°) scored much higher than we encouraged to consider Emerson and Hoaglin (1983) and would expect, given his age at first word. The scatterplot in Johnstone and Velleman (1985). Figure 2 compares the median-median line to two least- We now offer the reader a relatively simple study squares lines: one for the entire set of data, and the other of the comparative performance of the median-median line without the data for the latest-speaking child. Omitting the and the least squares line. data for the unusually high-scoring child increases the strength of the sample correlation but does not noticeably 3. Examples change the equation of the least-squares line; thus, that line is not included in the scatterplot. I. An investigation of the relationship between the number of manatees killed by boats in Florida and the number of powerboat registrations in that state, from 1977 to 2000 110 (Moore, 1997, p 347; Triola, 2004, p 495), shows a strong 100 positive correlation with no obvious outliers or influential points (r = 0.945, p<0.001). The median-median and least- 90 squares lines have similar equations, as is evident from the 80 scatterplot below (Figure 1).

Gesell score 70

60

50 0 10 20 30 40 50 age (months)

Figure 2. Scatterplot of Gesell Adaptive Score versus Age at Which Child Began Speaking. ───── Median-Median, ── ── ── Least Squares, ── ─ ── ─ Least Squares w/o influential observation.

4420 2003 Joint Statistical Meetings - Section on Statistical Education

III. A paper titled “The Relation Between Freely Chosen Outlier Generation Description Meals and Body Habits” (Amer. J. Clinical Nutrition a) No outliers No outliers (1983): 32-40) reports on the relationship between diet and b) Y = 0 + 1X + 5|ε | One high outlier in the body build. In particular, the researchers were interested in 13 13 13 middle of the middle determining how the energy intake in one’s diet is related to group. that individual’s body build. The scatterplot below plots the c) Outliers at the first two X- dietary energy density against the Quetelet index (also Y17 = 0 + 1X17 + 5|ε17| values in the upper group, known as body mass index) for nine individuals (Devore & Y18 = 0 + 1X18 - 5|ε18| one high and one low Peck, 2001, p 145-146). Including the outlying point (*) d) One high outlier at the first representing an individual with an unusually low energy Y17 = 0 + 1X17 + 5|ε17| X-value in the upper group density for his body build, these data have a moderate positive correlation (r = 0.658, p = .054). Note that the e) Y23 = 0 + 1X23 - 5|ε23| Outliers at the two largest X-values, one high and least-squares line calculated without the outlier differs Y24 = 0 + 1X24 + 5|ε24| greatly from the original least-squares line but is almost one low identical to the median-median line. f) Y17 = 0 + 1X17 + 5|ε17| High outliers at the first Y18 = 0 + 1X18 + 5|ε18| two X-values in the upper group

1.9 g) Y17 = 0 + 1X17 - 5|ε17| A low outlier at the first X- Y24 = 0 + 1X24 + 5|ε24| value in the upper group and a high outlier at the 1.4 largest X-value.

h) Y24 = 0 + 1X24 + 5|ε24| A single high outlier at the largest X-value

0.9 i) Y23 = 0 + 1X23 + 5|ε23| High outliers at the two Y24 = 0 + 1X24 + 5|ε24| largest X-values Dietary Energy Density Energy Dietary 0.4 For each type of outlier generation, 1000 samples 210 220 230 240 250 260 270 are simulated. For each sample the least squares and Quetelet Index median-median estimates of the intercept and slope are computed. Tables 1 and 2 provide the means, standard Figure 3. Scatterplot of dietary energy density versus deviations, and mean square errors of the least squares and Quetelet Index. median-median estimates of the for the 1000 ───── Median-Median replications for each of the nine methods of generating the ── ── ── Least Squares data. In the interest of space, we only present the results for ── ─ ── ─ Least Squares w/o outlier the first design. The results for the second design are very similar. The tables also show the value (Neter et 4. Simulation Study al., 1996, p 375-377) for the design points at which the outlier is generated. Leverage values help to identify To compare the statistical properties of least outlying X-values that, in conjunction with extreme Y- squares and median-median estimates of the slope of a linear observations, may lead to data points that will have a large regression model, a simulation study considers a variety of influence on the slope and intercept of the line. It is conditions. The linear regression model is Y = 0 + 1X + ε. 2 p recommended to compare the leverage value with n where After the expected Y-value is computed, normal error is p is the number of in the linear regression model. added. Outliers are created by multiplying the error at a 2 p particular design point by 5. In our case p = 2 (for the intercept and slope). Thus, n = The conditions under which the simulation is 0.167. In our example, no design point has a leverage that performed are: exceeds this value though the most extreme X-value comes (i) two sets of design points of the explanatory variables 2 p close to . Cook’s distance may be used to measure the (1, 2, 3, …, 24, and 2, 2, 4, 4, …, 24, 24), n (ii) two levels of the error (σ = 1 and 5), and influence of an individual case on all the fitted values. (iii) a number of outlier possibilities. Cook's distance can also be viewed as the distance between th Without loss of generality, in most cases we the regression coefficients calculated with and without the i th assume that the outlier occurs in the upper group of X- case. If this distance is large, the i case has a large values (rather than in the lower group of X-values). We influence on the fitted values and consequently the th consider: estimates. An influence measure of the i case on the parameter estimates may be calculated by DFBETAS (Neter et al, 1996, p. 382). This directly measures how much a parameter estimate changes with and without the ith case.

4421 2003 Joint Statistical Meetings - Section on Statistical Education

Figures 1 to 3 provide density estimates for three of 6. References these examples: no outliers, a moderate outlier example (high and low outliers at the end), and the most extreme Brown, G. W., and Mood, A. M. (1951) On Median Tests outlier example (two high outliers at the end). for Linear Hypotheses, in Proceedings of the Second The simulation study comparing the performance Berkely Symposium on Mathematica Statistics and of the two methods of regression shows less variation in the Probability, Berkeley, CA: University of California Press, least-squares estimates when no outliers are present. In the 159-166. moderate case (one high and one low outlier at the end), the Cobb, G. W, and Moore, D. S. (1997) Mathematics, least-squares method still tends to give unbiased estimates of Statistics, and Teaching, The American Mathematical the slope, but the median-median estimates are more Monthly, 104(9), 801-823. consistent when population variance is larger (σ2 = 25). In Devore, Jay, and Peck, Roxy (2001) Statistics: The the extreme case (two high outliers at the end), however, the Exploration and Analysis of Data, Fourth Edition, median-median method performs much better, with less Brooks/Cole, California. variation and more accuracy than the least-squares method, Emerson, J. D., and Hoaglin, D. C. (1983) Resistant Lines especially in the case of greater population variance (σ2 = 25 for y-versus-x, in Understanding Robust and Exploratory vs. σ2 = 1). Data Analysis, eds. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, New York: John Wiley. 5. Conclusion Hahn, G.J. (1988) What Should the Introductory Statistics Course Contain?, College Mathematics Journal , 19(1), 26- The median-median regression line is intended as a 29. simple, resistant alternative to the least-squares line. In Hartwig, F., and Dearing, B. E. (1979) Exploratory Data particular, it provides a method of line-fitting that is Analysis, Sage Publications: Beverly Hills, CA. computationally accessible to young students of statistics, Jiang, C.L. and Hunt, J.N. (1983) The Relation Between while at the same time giving estimates that are less Freely Chosen Meals and Body Habits, American Journal of sensitive to outliers than those of least-squares regression. Clinical Nutrition, 38, 32-40. The three examples provided in this paper show Johnstone, Iain M., and Velleman, Paul F. (1985) ‘The that when an outlier is present, the median-median line may resistant line and related regression methods,’ Journal of the be less affected by the outlier than the least-square line, American Statistical Association, 80, 1041-1054. when the outlier/influential point remains in the set. The Moore, David S. (1997) Statistics: Concepts and simulation study further supports the argument for using the Controversies, Fourth Edition, W. H. Freeman, New York. median-median line as a resistant method of regression, Moore, David S. and McCabe, George P. (1999) especially in the most extreme case (two high outliers at the Introduction to the Practice of Statistics, Third Edition, W. end) and in cases with greater population variance. H. Freeman, New York. On the other hand, least-squares regression (under Nair.K. R., and Shrivastava, M. P. (1942) On a Simple normal error assumptions) has a well developed set of tools Method of , Sankhaya, 6, 121-132. for inference and for detecting the presence of unusual Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman observations. In , inference about the median- W. (1996) Applied Linear Statistical Models, Fourth median estimates would require bootstrapping to obtain Edition, McGraw Hill, Boston. approximate standard errors, confidence intervals/regions, Tukey, J.W. (1971) Exploratory Data Analysis, Reading, and p-values for tests. Given this, in addition to the superior PA.: Addison-Wesley. performance of least-squares estimation when either no Velleman, P. F., and Hoaglin, D. C. (1981) Applications, outliers, or moderate outliers, are present, the median- Basics and Computing of Exploratory Data Analysis, median method of regression would not appear to be a Boston, MA: Duxbury Press. valuable tool at either the professional or collegiate level. Wald, A. (1940) The Fitting of Straight Lines if Both Middle and high school teachers, however, may find Variables Are Subject to Error, Annals of Mathematical median-median regression to be an effective tool to Statistics, 11, 282-300. introduce to students when addressing the issues of outliers and alternative estimators.

4422 2003 Joint Statistical Meetings - Section on Statistical Education

Table 1. Monte Carlo Results: Means, standard deviations, and mean square errors (× 10-3) of the least squares (LS) and median-median (MM) slope estimates for 1000 replications. Model: Y = 0 + 1X + ε X’s 1:24 ε ~ N(0, 1) Lower group: X1, …, X8 Middle group: X9, …, X16 Upper group: X17, …, X24

Estimate MSE (× 10-3) Type of outlier Leverage LS MM LS MM LS MM No outliers 1.0017 1.0037 0.0295 0.0509 0.873 2.604 One high in middle of middle 0.0419 1.0035 1.0037 0.0295 0.0509 0.882 2.604 group. First two X-values in upper group, 0.0593 0.9980 1.0291 0.0349 0.0568 1.222 4.073 one high and one low 0.0680 One high at first X-value in upper 0.0593 1.0170 1.0295 0.0315 0.0564 1.285 4.051 group Two largest X-values, one high and 0.1375 1.0054 0.9703 0.0472 0.0557 2.257 3.985 one low 0.1567 High at first two X-values in upper 0.0593 1.0357 1.0571 0.0338 0.0571 2.417 6.521 group 0.0680 Low at first X-value in upper group, 0.0593 1.0254 1.0037 0.0424 0.0511 2.442 2.625 high at largest X-value. 0.1567 High at largest X-value 0.1567 1.0405 1.0038 0.0405 0.0510 3.280 2.615 High at two largest X-values 0.1375 1.0749 1.0047 0.0482 0.0513 7.933 2.654 0.1567

Table 2. Monte Carlo Results: Means, standard deviations, and mean square errors (× 10-3) of the least squares (LS) and median-median (MM) slope estimates for 1000 replications. Model: Y = 0 + 1X + ε X’s 1:24 ε ~ N(0, 52) Lower group: X1, …, X8 Middle group: X9, …, X16 Upper group: X17, …, X24

Estimate Standard Deviation MSE (× 10-3) Type of outlier Leverage LS MM LS MM LS MM No outliers 1.0087 1.0141 0.1473 0.1850 21.773 34.424 One high in middle of middle 0.0419 group. First two X-values in upper group, 0.0593 0.9902 1.0614 0.1746 0.1948 30.581 41.717 one high and one low 0.0680 One high at first X-value in upper 0.0593 1.0848 1.0895 0.1576 0.1898 32.029 44.034 group Two largest X-values, one high 0.1375 1.0272 0.96190 0.2362 0.1985 56.530 40.854 and one low 0.1567 High at first two X-values in upper 0.0593 1.1786 1.1702 0.1691 0.1952 60.493 67.071 group 0.0680 Low at first X-value in upper 0.0593 1.1271 1.0155 0.2118 0.1969 61.014 39.010 group, high at largest X-value. 0.1567 High at largest X-value 0.1567 1.2023 1.0423 0.2026 0.1913 81.972 38.385 High at two largest X-values 0.1375 1.3743 1.0888 0.2408 0.2009 198.085 48.246 0.1567

4423 2003 Joint Statistical Meetings - Section on Statistical Education

Figure 1. Density estimates of the slopes for the 1000 replications for least squares and median-median estimates and for σ = 1 and σ = 5: no outliers. Density Estimates of Distributions of Slopes No Outliers 14

Median-Median, σ = 1 12 Least Squares, σ = 1 Median-Median, σ = 5 Least Squares, σ = 5 10 True Slope, β = 1

8

Density 6

4

2

0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Slope Estimate

Figure 2. Density estimates of the slopes for the 1000 replications for least squares and median-median estimates and for σ = 1 and σ = 5: one high and one low outlier at the end. Density Estimates of Distributions of Slopes 1 high and 1 low outlier at the end 9

8 Median-Median, σ = 1 Least Squares, σ = 1 7 Median-Median, σ = 5 Least Squares, σ = 5 True Slope, β = 1 6

5

4 Density

3

2

1

0 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 Slope Estimate

Figure 3. Density estimates of the slopes for the 1000 replications for least squares and median-median estimates and for σ = 1 and σ = 5: two high outliers at the end. Density Estimates of Distributions of Slopes Two high outliers at the end 8

7 Median-Median, σ = 1 Least Squares, σ = 1 Median-Median, σ = 5 6 Least Squares, σ = 5 True Slope, β = 1 5

4 Density 3

2

1

0 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Slope Estimate

4424