An Investigation of the Median-Median Method of Linear Regression

2003 Joint Statistical Meetings - Section on Statistical Education AN INVESTIGATION OF THE MEDIAN-MEDIAN METHOD OF LINEAR REGRESSION Elizabeth J. Walters, Christopher H. Morrell, and Richard E Auer Mathematical Sciences Department, Loyola College in Maryland 4501 North Charles Street, Baltimore, MD 21210-2699 [email protected] Keywords: Linear regression; Median-median line; Least squares line 1. Introduction what is called the median-median line, on the other hand, is not so well known or understood. Statistics education has been impacted greatly by This paper considers the performance of these two two trends. One trend is the continual advancement in line-fitting techniques as an aide to budding statisticians and computer technology that has systematically increased the their teachers as they encounter exploratory data analysis. power and refinement of statistical data analysis. In the We intend to illuminate the behavior of the relatively 1960's, main frame computers revolutionized statistical unknown median-median line with several data sets and methodology. During the 1970's, personal computers placed through a small simulation study. Readers may also computing capability right on the desks of individual compare how the median-median line reacts in various data researchers. This evolution has continued to progress and settings to the performance of the least squares line. now grammar school students learn data skills literally in the palm of their hands using graphing calculators. 2. Historical Evolution of the Median-Median Line The second trend is the extension of statistics courses earlier in students’ education. Younger students are The Median-Median Line traces back to a line- being exposed to the question of what truths lurk beneath fitting approach proposed by Wald (1940). He suggested a the surface of data. These students are now being trained to very simple method where the points on a scatterplot are use statistics as the primary tool for researching any topic separated into a left half and a right half based on the within any field. median of the x-scores in a sample of bivariate data. The Evidence of both trends are seen when elementary means of the x-scores and the y-scores are calculated using and secondary school students use graphing calculators to the data from only the left half of the scatterplot and then perform data analysis on exams and homework assignments. calculated using only the data on the right half. Much of the early statistical training is being accomplished Concentrating on the two points, (), yx RR and ( ,,yx LL ) by teaching the basics of what has been termed exploratory Wald proposed finding a line connecting these points that is data analysis. then adjusted up or down to better fit the full array of points As the first pioneer in exploratory data analysis, on the scatterplot. (The subscripts R and L denote which Tukey (1971, p. v.) effectively defined it as looking at data half of the data is used and no subscript implies the means to see what it seems to say. It concentrates on simple are taken over the entire sample.) His line of fit has slope arithmetic and easy-to-draw pictures. It regards whatever = ( − ) ( − xxyyb ) and y-intercept = − xbya . appearances we have recognized as partial descriptions, and LRLR tries to look beneath them for new insights. In his A similar procedure suggested by Nair and recommendations for what type of activity belongs in Shrivastava (1942) breaks up the points on a scatterplot into introductory statistics coursework, Hahn (1988, p. 27) said: three regions with each region containing about the same “When we get to data analysis, we should stress graphical number of points. The x and y means of the points in the methods or exploratory data analysis, over formal statistical left and right regions are then used to find the slope of the procedures.” Cobb and Moore (1997, p. 815) echoed the line of fit much as Wald suggested. sentiments of Hahn saying: “Students like exploratory Brown and Mood (1951) used the two-region analysis and find that they can do it, a substantial bonus approach but found the slope of the line of fit using medians when teaching a subject feared by many. Engaging them in place of means. The primary advantage of using this early on in the interpretation of results, before the harder measure of center comes from the median’s inherent ability ideas come to their attention, can help establish good habits to resist the strong effect of outliers. Most students of when you get to inference.” statistics know that the mean is greatly impacted by outlier Often, one major goal of young statisticians as they cases since they are included with equal weight with the rest learn exploratory data analysis is studying the relationship of the data in the sum of the scores. But the median takes on between variables. When the data is bivariate in two the same value whether the largest score in a data set is just numerical variables, this usually involves fitting a straight somewhat larger than the rest of the data or is much larger line through points on a scatterplot. The graphing calculator than the second biggest score. In the context of fitting is programmed with two methods of line-fitting for bivariate points on a scatterplot, this implies that a single point far data. One method, fitting the least squares line, is a from the general sloping trend of the rest of the points would procedure well-studied and routinely applied. But finding not apply such a large “tug” on the location of the line of fit if the median is used to find the line. 4419 2003 Joint Statistical Meetings - Section on Statistical Education 80 Readers may recall that the least squares line is 70 found by minimizing the sum of the squared distances that 60 each point lies from the line. Since these distances are 50 squared, Hartwig and Dearing (1979, p. 34) noted that 40 “cases lying farther and farther from the regression line 30 increase the sum of the squared residuals at an increasing rate....[and the line] will have to come reasonably close to manatees killed 20 them to satisfy the least squares criterion and, therefore, the 10 least squares regression line will lack resistance to the 0 excessive influence of a few atypical cases.” This means 45 55 65 75 85 that Brown and Mood’s method is not only simple to apply, boat registrations (in 10,000s) but also has the advantage of not allowing outlying cases to have undue impact on determining the line of fit. Figure 1. Scatterplot of manatees killed by powerboats Like Brown and Mood, Tukey (1971) utilized the versus number of boat registrations. medians in finding his line of fit, but he did so borrowing ───── Median-Median the three region approach of Nair and Shrivastava. His line ── ── ── Least Squares of fit, called the resistant line, is considered a basic methodology of exploratory data analysis. And it is Tukey’s resistant line that serves as the basis for the graphing II. A scatterplot of scores on a test of mental aptitude calculator’s median-median line. (Gesell Adaptive Test) versus age at first word, for 21 For young statisticians who wish to learn more children (Moore & McCabe, 1999, p 149), shows a about the broad methodology of exploratory data analysis, moderate negative correlation (r = -0.640, p=0.002). Three Velleman and Hoaglin (1981) offer a gentle review of the points in this scatterplot are notable. Two children (*) entire subject. To learn more about the median-median line began speaking at a much later age than the rest of the at a more sophisticated mathematical level, readers are children. Another child (°) scored much higher than we encouraged to consider Emerson and Hoaglin (1983) and would expect, given his age at first word. The scatterplot in Johnstone and Velleman (1985). Figure 2 compares the median-median line to two least- We now offer the reader a relatively simple study squares lines: one for the entire set of data, and the other of the comparative performance of the median-median line without the data for the latest-speaking child. Omitting the and the least squares line. data for the unusually high-scoring child increases the strength of the sample correlation but does not noticeably 3. Examples change the equation of the least-squares line; thus, that line is not included in the scatterplot. I. An investigation of the relationship between the number of manatees killed by boats in Florida and the number of powerboat registrations in that state, from 1977 to 2000 110 (Moore, 1997, p 347; Triola, 2004, p 495), shows a strong 100 positive correlation with no obvious outliers or influential points (r = 0.945, p<0.001). The median-median and least- 90 squares lines have similar equations, as is evident from the 80 scatterplot below (Figure 1). Gesell score 70 60 50 0 10 20 30 40 50 age (months) Figure 2. Scatterplot of Gesell Adaptive Score versus Age at Which Child Began Speaking. ───── Median-Median, ── ── ── Least Squares, ── ─ ── ─ Least Squares w/o influential observation. 4420 2003 Joint Statistical Meetings - Section on Statistical Education III. A paper titled “The Relation Between Freely Chosen Outlier Generation Description Meals and Body Habits” (Amer. J. Clinical Nutrition a) No outliers No outliers (1983): 32-40) reports on the relationship between diet and b) Y = 0 + 1X + 5|ε | One high outlier in the body build. In particular, the researchers were interested in 13 13 13 middle of the middle determining how the energy intake in one’s diet is related to group.

Load more