Robust Regression Methods for Computer Vision: a Review

Home , Computational statistics, Robust regression

International Journal of Computer Vision, 6:1, 59 70 (1991) (3 1991 Kluwer Academic Publishers. Manufactured in The Netherlands.

Robust Regression Methods for Computer Vision: A Review

PETER MEER,* DORON MINTZ, AND AZRIEL ROSENFELD Center for Automation Research, University of Maryland, College Park, MD 20742 DONG YOON KIM Agency for Defense Development, P.O. Box 35, Taejeon, Korea

Abstract Regression analysis (fitting a model to noisy data) is a basic technique in computer vision. Robust regression methods that remain reliable in the presence of various types of noise are therefore of considerable importance. We review several robust estimation techniques and describe in detail the least-median-of-squares (LMedS) method. The method yields the correct result even when half of the data is severely corrupted. Its efficiency in the presence of Gaussian noise can be improved by complementing it with a weighted least-squares-based procedure. The high time-complexity of the LMedS algorithm can be reduced by a Monte Carlo type speed-up technique. We discuss the relationship of LMedS with the RANSAC paradigm and its limitations in the presence of noise corrupting all the data, and we compare its performance with the class of robust M-estimators. References to published applications of robust techniques in computer vision are also given.

1 Introduction on the underlying noise distribution. For example, in the presence of Gaussian noise the mean estimator has Regression analysis (fitting a model to noisy data) is an asymptotic (large sample) efficiency of 1 (achieving an important statistical tool frequently employed in the lower bound) while the median estimator's effi- computer vision for a large variety of tasks. Tradition ciency is only 2/7r = 0.637 (Mosteller & Tukey 1977). and ease of computation have made the least squares The breakdown point of a regression method is the method the most popular form of regression analysis. smallest amount of outlier contamination that may force The least squares method achieves optimum results the value of the estimate outside an arbitrary range, For when the underlying error distribution is Gaussian. example, the (asymptotic) breakdown point of the mean However, the method becomes unreliable if the noise is 0 since a single large outlier can corrupt the result. has nonzero-mean components and/or if outliers The median remains reliable if less than half of the data (samples with values far from the local trend) are pres- are contaminated, yielding asymptotically the max- ent in the data. The outtiers may be the result of clut- imum breakdown point, 0.5. ter, large measurement errors, or impulse noise cor- The time complexity of the least squares method is rupting the data. At a transition between two homo- O(np 2) where n is the number of data points and p is geneous regions of the image, samples belonging to one the number of parameters to be estimated. Feasibility region may become outliers for fits to the other region. of the computation requires a time complexity of at Three concepts are usually employed to evaluate a most O(n2). regression method: relative efficiency, breakdown A new, improved regression method should provide: point, and time complexity. The relative efficiency of a regression method is defined as the ratio between the a. reliability in the presence of various types of noise~ lowest achievable variance for the estimated parameters i.e., good asymptotic and small sample efficiency; (the Cramer-Rao bound) and the actual variance pro- b. protection against a high percentage of outliers, i.e., vided by the given method. The efficiency also depends a high breakdown point; *Present address: Department of Electrical and Computer Engineer- c. a time complexity not much greater than that of the ing, Rutgers University, Piscataway. NJ 08855-0909 U.S,A. least squares method. 60 Meer, Mintz, Rosenfeld and Kim

Many statistical techniques have been proposed The M-estimators are the most popular robust regres- which satisfy some of the above conditions. These sion methods. (See the book by Hampel et al. (1986) techniques are known as robust regression methods, In for a leading reference.) These estimators minimize the section 2 a review of robust regression methods is sum of a symmetric, positive-definite function p(ri) of given. In section 3 the least-median-of-squares the residuals ri, with a unique minimum at r i = 0. (A (LMedS) method is discussed in detail. A comparison residual is defined as the difference between the data of the LMedS method with the RANSAC paradigm in point and the fitted value.) For the least squares method section 4 gives us the opportunity to analyze its p(r~) = r~. Several p functions have been proposed behavior for data usually met in computer vision prob- which reduce the influence of large residual values on lems. Through a simple application of LMedS, window- the estimated fit. Huber (1981) employed the squared operator based smoothing, we compare its performance error for small residuals and the absolute error for large in section 5 with the class of robust M-estimators. The residuals. Andrews (1974) used a squared sine function article is concluded in section 6. for small and a constant for large residuals. Beaton and Tukey's (1974) biweight is another example of these p functions. The M-estimates of the parameters are ob- 2. Robust Regression Methods tained by converting the minimization

The early attempts to introduce robust regression min Z o(ri) (2) methods involved straight-line fitting. In one class of i methods the data is the first partitioned into two or three into a weighted least squares problem which then is nearly equal sized parts (i < L; L < i < R; R < i), handled by available subroutines. The weights depend where i is the index of the data and L = R in the former on the assumed o function and the data. The reliabil- case. The slope fll and the intercept Go of the line are ity of the initial guess is of importance and the con- found by solving the system of nonlinear equations vergence of the solution is not proved for most of the 0 functions. Holland and Welsch (1977) developed med (Yi - t3o - t3,xi) =med (Yi - 13o - 13,xi) iR algorithms for solving the numerical problems asso- (1) ciated with M-estimators. We will return to the sub- med (Yi - Go - Blx~) = 0 for all i ject in section 5. R-estimators are based on ordering the set of where med represents the median operator applied to residuals. Jaeckel (1972) proposed obtaining the the set defined below it. The breakdown point of the parameter estimates by solving the minimization method is 0, 5/k where k is the number of partitions (2 problem or 3) since the median is used for each part separately. Brown and Mood (1951) investigated the method for k min ~ an(Ri)r i (3) = 2, and Tukey introduced the resistant line procedure i for k = 3 (see Johnstone & Velleman 1985). where r i is the residual; R i is the location of the Another class of methods uses the slopes between residual in the ordered list, that is, its rank; and an is each pair of data points without splitting up the data a score function. The score function must be monotonic set. Theil (1950) estimated the slope as the median of and all n(n - 1)/2 slopes which are defined by n data points. The breakdown point of these methods is 0.293 since a~(Ri) = 0 at least half the slopes should be correct in order to i obtain the correct estimate. That is, if e is the fraction of outliers in the data we must have (1 - e)2 > 0,5. The most frequently used score function is that of The intercept can be estimated from the input data by Wilcoxon: a~(Ri) = Ri - (n + 1)/2. Since la,,(Ri)] <- employing the traditional regression formula. (n - 1)/2, the largest residuals caused by outliers cannot The theory of multidimensional robust estimators was have too large a weight. Scale invariance (independence developed in the seventies. The basic robust estimators from the variance of the noise) is an important advan- are classified as M-estimators, R-estimators and L- tage of R-estimators over M-estimators. Cheng and estimators (Huber 1981). Hettmansperger (1983) presented an iteratively Robust Regression Methods for Computer Vision." A Review 61 reweighted least squares algorithm for solving the median computations are performed over the entire data minimization problem associated with R-estimators. set. Computation of the median is O(n log n) for prac- The L-estimators employ linear combinations of tical purposes, and thus the time complexity of the RM order statistics. The median and or-trimmed-mean based method is high, of O(n p log p n) order. The Gaussian methods belong to this class. It is important to notice, efficiency of the repeated median method was found however, that the mean (c~ = 0) is a least squares experimentally as being only around 0.6 (Siegel 1982). estimate, while the median can be regarded also as the The repeated median is not affine equivariant, that is, M-estimate obtained for p = Irij. Various simulation a linear transformation of the explanatory variables does studies have shown that L-estimators give less satisfac- affect the estimate (Rousseeuw & Leroy 1987, p. 152). tory results than the other two classes (Heiler 1981). The least median of squares (LMedS) robust regres- In spite of their robustness for various distributions sion method proposed by Rousseeuw (1984) also the M-, R- and L-estimators have breakdown points that achieves 0.5 breakdown point. (We prefer to use the are less than 1/(p + 1), where p is the number of LMedS notation instead of LMS, which is used in the parameters in the regression (Hampel et al. 1986, p. statistical literature, to avoid confusion with the usage 329; Li 1985). For example, in planar surface fitting in the image processing field where LMS stands for we have p = 3, and the breakdown point is less than least mean squares.) The relative efficiency of the 0.25, making it sensitive to outliers. LMedS method can be improved by combining it with Recently several robust estimators having breakdown least-squares based techniques. The time complexity points close to 0.5 were proposed. Siegel (1982) in- can be reduced by a Monte Carlo type speed-up tech- troduced the repeated-median (RM) method of solv- nique. We will return to this estimator in section 3. ing multidimensional regression problems. Suppose p parameters are to be estimated from n data samples. A parameter is estimated in the following way: First, 2.1 Robust Methods in Computer Vision for each possible p-tuple of samples the value of the parameter is computed yielding a list of C(n, p) (the Without trying to give an exhaustive survey, in this sec- binomial coefficient) terms. Then the medians for each tion we enumerate some of the computer vision applica- of the p indexes characterizing a p-tuple are obtained tions using the robust techniques mentioned above. For recursively. When the list has collapsed into one term, the most recent results see the Proceedings of the In- the result is the RM estimate of the parameter. Once ternational Workshop on Robust Computer Vision, Seat- a parameter has been estimated, the amount of com- tle, WA, October 1990. putation can be reduced for the remaining p - 1 Median- and trimmed-mean based local operators (L- parameters. estimators) have been employed in computer vision for For example, let p = 3 and suppose that we start by a long time. See for example (Bovik et al. 1987) and estimating the parameter f12 of the planar fit (Coyle et al. 1989) for literature reviews on gray-level image smoothing and noise removal. z = t3o + 13~x + flzY (4) In the median of intercepts line fitting method of First the values Kamgar-Parsi et al. (1989), similar to (Theil 1950) pair- wise median, the intercept and slope is computed for fl2(i,j, k) = (zi - zj)(xj - Xk) -- (Zj -- zk)(xi -- xj) (5) every pair of points and the medians of the resulting (Yi -- yj)(Xj -- Xk) -- (yj -- YD(Xi - xj) lists are the estimates of the two parameters. are computed for all the triplets defined by i # j # k. Recently M-estimators have also become popular in The estimate is then computer vision. Kashyap and Eom (1988) treated an image as a causal autoregressive model driven by a /~2 =med med med fl2(i, j, k) (6) noise process assumed to be Gaussian with a small per- i j(#i) k(#i,j) cent of the samples (at most 8%) contaminated by im- The parameter fll is estimated next, by applying the pulse noise, that is, outliers. The same technique was same algorithm for p = 2 to the data zi - f12 Yi. Simi- used by Koivo and Kim (1989) for classification of sur- larly the value of/30 is obtained by taking the median face defects on wood boards. By employing M-esti- of the samples zi - [32yi - f31xi. The breakdown point mators the parameters of the autoregressive process of the repeated median method is 0.5 since all the partial were iteratively refined simultaneously with cleaning 62 Meer, Mintz, Rosenfeld and Kim the outliers in the noisy image. Besl et al. (1988) pro- is too large, only a randomly chosen subset of points posed a hierarchical scheme in which local fits of in- can be analyzed. We will return to this problem later creasing degrees were obtained by M-estimators. The in the section. different fits were compared through a robust-fit quality Assume that our data consist of n points from which measure to determine the optimal parameters. Haralick we want to estimate the p regression coefficients/3j, and Joo (1988) applied M-estimators to solve the corres- j = 0, 1..... (p - 1) of the linear model pondence problem between two sets of 2D perspective projections of model points in 3D. The correct-pose p-I solution was then obtained with up to 30% of the pairs Zi = ~ fljxi(i) i = 1, 2 ..... n (8) mismatched. A similar algorithm was used by Lee et j=0 al. (1989) for estimating 3D motion parameters. The explanatory variables xj(i) are monomials of dif- The least median of squares estimator has also been ferent degrees in the sampling lattice coordinates. The used to solve computer vision problems. The high model (8) thus represents polynomial surface fitting to breakdown point makes it an optimum nonlinear in- the data. terpolator (Kim et al. 1989 and Tirumalai & Schunck Let a distinct p-tuple of data points be denoted by 1988), and it can be employed for image structure the indexes il, . •., ip. For this p-tuple the values of analysis in the piecewise polynomial (facet) domain p - 1 parameters, all except the intercept,/30(il, ..., [Meer et al. 1990]. Kumar and Hanson (1989) used least ip),can be computed by solving the p linear equations median of squares to solve the pose estimation problem. in flj(i t .... , ip), j = 1 .... , (p - 1), A variant of LMedS, the minimum-volume ellipsoid estimator, was used to develop a robust feature-space p-I clustering technique which was then applied to zi : E fli(il, "", ip) xj(i) i : il .... , ip (9) histogram decomposition, Hough space analysis, and j=0 range image segmentation (Jolion et al. 1990 a, b). The intercept value fl0(i I .... , ie) is then found by solving the minimization problem 3 The Least-Median-of-Squares Method min med r 2 given ~3j(il , ..., ip), j = 1 ..... (p - 1) Rousseeuw (1984) proposed the least-median-of-squares (10) (LMedS) method in which the parameters are estimated by solving the nonlinear minimization problem. The reduction of a multidimensional regression problem to one dimension is known as the projection pur- min med r 2 (7) suit technique, and has numerous applications in i statistics (Efron 1988). Rousseeuw and Leroy (1987) That is, the estimates must yield the smallest value for have discussed the connection between this technique the median of squared residuals computed for the en- and the LMedS estimators. The goal of LMedS method tire data set. An excellent application-oriented reference is to identify the "outliers" in the data, that is, points on the least-median-of-squares technique is the book severely deviating from the model. If these points can of Rousseeuw and Leroy (1987) to which the reader is be discriminated in the projection along the direction referred to for more details on the algorithm described of the intercept, the above procedure is optimum. We below. The application of LMedS estimators to line fit- discuss the problem in more detail in section 4. ting (p = 2) was studied in depth by Steele and Steiger To solve (10) we must use a mode-estimation tech- (1986) and by Edelsbrunner and Souvaine (1988). They nique. The mode of a continuous probability distribu- investigated the number of local minima of (7), and gave tion is the location of its maximum. In the case of a optimum time-complexity, O(n2), algorithms. discrete, ordered sequence, the mode corresponds to The LMedS minimization problem (7) cannot be the center of the subinterval having the highest den- reduced to a least-squares based solution, unlike the sity. Mode-seeking algorithms are well-known, see for M-estimators (2). The least-median-of-squares minimi- example Press et al. (1988, p. 462). It can be shown zation must be solved by a search in the space of possi- that if the width of the search window is half the data ble estimates generated from the data. Since this space size (i.e., [n/2J ) the mode minimizes the median of Robust Regression Methods for Computer Vision: A Review 63 the squared residuals (Rousseeuw & Leroy 1987, p. an increase in complexity (Mintz & Amir 1989). Ad- 169). Thus, if we define the projected sequence ditional speed-up relative to the original LMedS algo- p-1 rithm can be obtained by decomposition in both the spatial and parameter domains (Mintz et al. 1990). Si = Zi - Z flj(il' "" "' ip)Xj(i) (11) j=l When Gaussian noise is present in addition to outliers, the relative efficiency of the LMedS method is low. Rousseeuw (1984) has shown that the LMedS compute its mode, and take it as/30(il ..... ip), we ob- method converges for large sample sizes as n -1/3, tain the solution of (10). much more slowly than the usual n -1/2 for maximum The mode-seeking procedure also returns the width likelihood estimators. To compensate for this deficiency of the half-data size search window corresponding to he proposed combining the LMedS method with a the mode location. For n data points C(n, p), such win- weighted least squares procedure which has high Gaus- dow sizes are obtained and the smallest one yields the sian efficiency. Either one-step weighted least squares final LMedS estimate for the p regression coefficients or an M-estimator with Hampel's redescending func- of the model, that is, the solution of (7). tion can be employed but the latter appears to be less The breakdown point of the least median squares effective (Rousseeuw & Leroy 1987). Simultaneous method is 0.5 because all the median computations are presence of significant Gaussian noise and numerous over the whole data set. The time-complexity of the outliers decreases the reliability of the LMedS estimates method, however, is very high. There are O(n p) p- as is discussed in section 4. tuples and for each of them the sorting takes O(n log The robust standard deviation estimate n) time. Thus the amount of computation required for the basic LMedS algorithm is O(n p÷1 log n), pro- 3= 1"4826 I1 + n 5 plmed~/~ (13) hibitively large. Notice that this complexity is valid only if p _> 2, since for p = 1 only sorting is required. The time complexity is reduced to practical values can be immediately obtained since the median of the when a Monte Carlo type speed-up technique is residual is the value returned by the LMedS procedure employed in which a Q ,~ 1 probability of error is for the final parameter estimates. The factor 1.4826 is tolerated. Let e be the fraction of data contaminated for consistent estimation in the presence of Gaussian by outliers. Then the probability that all m different p- noise, and the term 5/(n - p) is recommended by tuples chosen at random will contain at least one or Rousseeuw and Leroy (1987) as a finite sample more outliers is correction. Based on the robust LMedS model and the standard P = [1 - (1 - e)p]m (12) deviation estimate binary weights can be allocated to the data points: Note that 1 - P is the probability that at least one p- tuple from the chosen m has only uncorrupted samples Iri[ < 2.5 and thus the correct parameter values can be recovered. The smallest acceptable value for m is the solution of w i = f 1 (14) the equation P = Q, rounded upward to the closest in- Iri~I > 2.5 teger, and is independent of n, the size of the data. The 0 amount of computation becomes O(mn log n). This time-complexity reduction is very significant. For ex- The data points having wi = 1 are inliers, that is, they ample, ifp = 3, Q = 0.01, and e = 0.3, then m = belong to the assumed model. Points having w i = 0 11 for any n. Thus, when at most 30 percent of the data are outliers and should not be further taken into con- is contaminated by outliers, by choosing 11 triplets for sideration. The weighted least square estimates are then the computation of the LMedS planar surface fit, the the solutions of the minimization problem probability of having the whole set of triplets corrupted min ~ wir~ (15) is 0.01. Rousseeuw and Leroy (1987) recommended tak- i ing a larger set of p-tuples than the number obtained from the probabilistic considerations. During the ran- and are obtained by using any available programming dom sampling, identical p-tuples can be avoided without package. 64 Meer,Mintz, Rosenfeld and Kim

The presence of noise corrupting all the samples, set also cannot be less than half the data size. For ex- however, may create severe artifacts. Since this is the ample, suppose that a constant must be estimated from case in numerous computer vision applications, in the data having only two values. Then, if the cardinality next section we discuss the problem in detail. threshold on the consensus set is less than half the data size, the RANSAC procedure may validate either of the two constants. 4 Analysis of the LMedS Technique: Comparison To recover a model representing the relative and not with RANSAC absolute majority in the data, additional procedures should be implemented. Bolles and Fischler (1981) pro- The LMedS estimates are computed by minimizing the posed the use of nonparametric tests for detecting the cost function (7) through a search in the space of possi- "white" structure of residuals. A decomposition tech- ble solutions. Subsets of the data are chosen by ran- nique applied in both the spatial and parameter domains dom sampling and for each subset a model is estimated. (Mintz et al. 1990) also succeeds to return reliable The number of points in the subset is equal to the estimates when less than half the data carry the model. number of unknown model parameters, thus yielding The maximum number of subsets required to validate closed form solutions. If the number of points used is a model for a tolerated probability of error is estab- increased, in which case the model must be estimated lished by the same relation (12) in both LMedS and by least squares, no improvement in the final LMedS RANSAC. In LMedS, the algorithm will process that estimates can be expected (Meer et al. 1990). many p-tuples; in RANSAC, consensus may be found The search technique employed to find the LMedS earlier. However, if a tolerated residual error threshold estimates (projection pursuit) is common for several is used in LMedS, a stopping criterion similar to that high breakdown point robust estimators (Rousseeuw & of RANSAC is obtained. Using the statistical ter- Leroy 1987). A similar processing principle was pro- minology, RANSAC maximizes the number of inliers posed independently by Fischler and Bolles (1981) for while seeking the best model, and LMedS minimizes solving computer vision problems in the RANSAC a robust error measure of that model. The two criteria paradigm. They too compute a model by solving a are not mathematically equivalent, but since in the system of equations defined for a randomly chosen general case at least half of the data should become in- subset of points. All the data is then classified relative liers, they yield very similar results. to this model. The points within some error tolerance We conclude that LMedS and RANSAC, in spite of are called the consensus set of the model. If the car- being independently developed in different research dinality of the consensus set exceeds a threshold, the areas, are based on similar concepts. The only dif- model is accepted and its parameters recomputed based ference of significance is that the LMedS technique on the whole consensus set. If the model is not accepted generates the error measure during the estimation pro- a new set of points is chosen and the resulting model cedure, while RANSAC must be supplied with it. This is tested for validity. The error tolerance and the con- is an important feature when the noise is not homo- sensus set acceptance threshold must be set a priori. geneous, that is, the model is selectively corrupted. The In LMedS, the computed model is allocated an er- robustness of both LMedS and RANSAC in the ror measure, the median of the squared residuals. presence of severe deviations from the sought model Several models are tried and the one yielding the is contingent upon the existence of at least one subset minimum error is retained. The points are classified of data carrying the correct model. When noise cor- into inliers and outliers only relative to this final fit. rupts all the data (e.g., Gaussian noise) the quality of The LMedS model can then be refined by a least initial model estimates degrades and could lead to in- squares procedure on the inlier data points. By its correct decisions. The problem is emphasized when the definition the LMedS estimator will always return at model contains more parameters to represent the data least 50 percent of the data points as inliers. The error than is necessary. That is, the order of the model is tolerance is set automatically by the estimated robust incorrect. standard deviation. Consider again the two-valued data case, specificall); The LMedS estimator has a 0.5 breakdown point. a step edge of amplitude h, the difference between the Without a priori information about the structure of the two values. Assume the data is evenly split, that is, e data, in RANSAC the size of an acceptable consensus is close to 0.5. All the data is corrupted by a significant, Robust Regression Methods for Computer Vision: A Review 65 zero-mean noise process having values with significant window. There are now 13 pixels with high amplitudes probability in (-a, a), with a close to h/2. Let the and the operator returns, incorrectly, a high value RANSAC error threshold be a, and recall that this value similar to the amplitude of the edge. Thus, even when must be chosen a priori. A first-order model is em- the fraction of corrupted points is much smaller than ployed (p = 3) and thus planar fits are sought. The the theoretical breakdown point of a robust estimator, noise makes the models computed from triplets of data the operator may systematically fail near transitions be- points unreliable. A model representing a tilted plane tween homogeneous regions in images. At transitions, (similar to what the nonrobust least squares procedure samples of one region are outliers (noise) when fitting would recover from all the data) already yields an abso- a model to the other region, and a small fraction of lute majority of points within the error tolerance, and additional noise may reverse the class having the ma- is accepted by RANSAC. Reducing the error threshold jority. The size of the local operator also limits e, the decreases the expected number of points within the maximum amount of tolerated contamination, but this bounds, and thus a satisfactory consensus set size may artifact has less importance for images where the win- not be obtained for the correct model. With high prob- dow operators have relative large supports. ability RANSAC will return an incorrect decision. It should not be concluded from this section that high The LMedS procedure will also err. Most of the breakdown-point estimation procedures are not useful squared residuals relative to a tilted plane are between for computer vision. We have emphasized their sen- (0, a 2) and thus the median of the squared residuals sitivity to piecewise models, that is, to data containing is much less than a 2. Relative to the correct, horizon- discontinuities, in order to increase awareness of pos- tal plane slightly more than half of the squared residuals sible pitfalls. In section 2.1 several successful appli- are between (0, a 2) with the median being close to a z. cations of the LMedS estimators were mentioned. Thus the LMedS algorithm too will prefer the tilted Recently we have proposed a new technique for noisy- plane to the correct solution. For experimental data on image analysis in the piecewise polynomial domain this artifact of the LMedS estimators, and how it can combining simple nonrobust and robust processing be eliminated by a two-stage procedure, see Meer et modules. The method has a high breakdown point but al. (1990) and Mintz et al. (1990). avoids the problems discussed above. In the next sec- The robustness of the RANSAC paradigm, and the tion we use a simple application of the LMedS esti- high breakdown point of the LMedS estimators, are thus mators, window-operator-based smoothing, to compare meaningful only in situations when at least half of the its performance with M-estimators. data lie close to the desired model. For LMedS, in this case projection into the J30 direction results in a well- defined mode and there is no need for a complete pro- 5 Image Smoothing: LMedS vs. M-estimators jection pursuit procedure in which the optimum projection direction is also sought. Signals unchanged by the application of an operator are When LMedS local operators are applied to model called root signals of that operator. The existence of discontinuities the number of tolerated outliers root signals assures the convergence of iterative decreases. Consider again the noiseless, ideal step edge nonlinear filtering procedures (Fitch et al. 1985). Their to which a 5 x 5 robust local operator is applied. importance is also recognized in computer vision (e.g., Assume that 10 pixels in the window belong to the edge Haralick & Watson 1981; Owens et al. 1989). In one (high amplitude) and 15 to the background (low dimension, a noiseless piecewise polynomial signal is amplitude) and that the center of the window falls on a root signal of an LMedS-based window operator of a background pixel. The operator returns the value of degree equal to the highest polynomial degree. This the majority of pixels, that is, the low amplitude of the property is in fact true for any 0.5 breakdown point background. estimator. Indeed, since the input is an ideal piecewise The image is then corrupted with fraction e = 0.2 polynomial signal, a contiguous group of Fn/2q pix- of asymmetric noise driving the corrupted samples into els always carries information about the same saturation at the upper bound. Without loss of generality polynomial. The 0.5 breakdown point assures that the we can assume that only 3 of the pixels belonging estimated regression coefficients are those of this to the background were corrupted in the processing polynomial. The value returned by the operator is the 66 Meer, Mintz, Rosenfeld and Kim value of the polynomial at the window center and is absolute-values region the large residuals have hyper- identical with the input. The noiseless one-dimensional bolically decreasing weights. To achieve 95 % asymp- piecewise polynomial signal remains undistorted. This totic efficiency for Gaussian noise Holland and Welsch root-signal property is not necessarily valid in two (1977) recommended CH = 1.345. dimensions. When the window is centered on a cor- In the class of redescending M-estimators, the large ner, it is not always the case that 51% of the pixels residuals have zero weights thus improving the outlier belong to the surface in the window center. Before pro- rejection properties. The biweight ceeding to compare root-signal properties of the M- estimators and the LMedS estimator we give a short description of the former. M-estimators have already been introduced in sec- (19) tion 2. The minimization problem (2) is reduced to reweighted least squares by differentiating (2) with respect to the sought regression coefficients. After some simple manipulations, the following expression for the weights is obtained: proposed by Beaton and Tukey (1974) is a well-known example of this class. The weight function is shown in 1 do(r/) wi = -- -- (16) figure 2b and the p function it was derived from in ri dG figure 2a. Holland and Welsch (1977) recommended CB where ri is the residual of the ith data point. Since the ---- 4.685 to assure superior performance for Gaussian p functions have a minimum at rl = 0, expression (16) noise. This tuning constant value was obtained from can always yield wi = 1 at the origin. All the weights a Monte Carlo study employing homogeneous data are positive and between 0 and 1. Different p functions (i.e., from one model). In computer vision we are often result in different weights. We consider here two such dealing with piecewise models (i.e., data with discon- functions. tinuities) and smaller tuning constants should be used The first robust M-estimator was proposed by Huber to discriminate the transitions. In our experiments we in 1964 (see Huber 1981). It is known as the minimax took CB = 2. M-estimator and has least squares behavior for small The weights cannot be computed without a standard residuals, and the more robust least-absolute-values deviation estimate, which in turn requires the estima- behavior for large residuals. The change in behavior tion of an initial fit. The quality of the initial fit is of is controlled by a tuning constant c H and the locally paramount importance, especially for redescending M- estimated noise standard deviation estimators. The use of robust estimators like least absolute deviations or LMedS is recommended (Hampel et = 1.4826 med Iri - med ril (17) al., 1986). We are interested in the outlier rejection where med denotes the median taken over the entire capability of the M-estimators, since this property window, and the factor 1.4826 compensates for the bias assures the undistorted recovery of the input. Therefore of the median estimator in Gaussian noise. Note that to facilitate the comparison between M-estimators and (17) is a robust estimator of a and can tolerate up to LMedS we employ an unweighted least squares proce- half the data being corrupted. The minimax weight dure (all wi = 1) to obtain the initial M-estimates. The function has the expression residual relative to this fit can be computed and the standard deviation estimate obtained. Each data point is then allocated a weight and in the next iteration the new 1 Iri[ ~ CH& set of parameters is obtained by weighted least squares. ~ (18) w,,,i = c.S The standard deviation estimate 3 should not be up- ~_ Iril Ir, I > dated during the iterations (Holland & Welsch 1977). At consecutive iterations, samples yielding large residuals (outliers) have their weights reduced, and the and is shown in figure lb together with its p function estimates are computed mostly from values distributed (figure la). Note that the constant weight corresponds around the true surface. Often for the latter samples to the central (least squares) region, and in the least- a Gaussian distribution is assumed. Robust Regression Methods for Computer Vision: A Review 67 y J L (a) (b)

Fig. 1. The Huber minimax M-estimator. (a) The 0 function. (b) The weight function. 1/ (b)

Fig. 2. The biweight M-estimator. (a) The 0 function. (b) The weight function. The following quality measure was used by us as a window. The M-estimators did not converge for less stopping criterion for the iterations: than one percent of the pixels. For each method the smoothing error, defined as square root of the sum of the squared differences between the input and output values of a pixel, was computed. (See table 1.) As is well known, the least squares fit fails at discontinuities and significant blurring of the input occurs. This fit is also the initial estimate for the M-estimators. It yields an incorrect, high standard deviation estimate t=l and the iterative procedure cannot completely recover where the index I denotes the current iteration. Con- the input. The two M-estimators also blur edges, vergence is achieved if although in a lesser amount. The minimax M-estimator has nonzero weights for all the pixels and thus in- IE ~t~ - E¢l-l~[ < 0.001 I = 1, 2 .... (21) troduces slightly more blurring at edges than the The number of allowed iterations was bounded to 25. biweight estimator. All the errors made by the LMedS A 100xl00 synthetic image containing several estimator are due to the artifact at the corners. This polyhedral objects was used as an example. Only planar artifact can be eliminated by cooperative processes surfaces are present, joined at either step or roof edges. (Meer et al. 1990; Mintz et al. 1990). In figure 3a the gray level and in figure 3b the wire- Thus due to its high breakdown point only the LMedS frame representation of the image is shown. The im- window operator succeeds in preserving a piecewise age was smoothed with 5x5 window operators by polynomial image. This property is useful when the replacing the value of the center pixel with the intercept operator is used as a nonlinear interpolator. We have of the estimated plane. Note that as an artifact of this compared in this section the raw performances of the simple smoothing procedure a systematic error is in- estimators. Their properties can be improved when troduced at the corners where the center pixel does not using them as building blocks in more complex belong to the surface containing the majority. algorithms. The smoothing performance of the M- We have used least squares (figure 4a), the minimax estimators is corrected when several weight functions M-estimator (figure 4b), the biweight M-estimator are employed at successive iterations and the model (figure 4c), and the LMedS estimator (figure 4d) in the order is sequentially increased under the control of a 68 Meer, Mintz, Rosenfeld and Kim

(b)

Fig. 3. A synthetic 100xl00 image. (a) Gray-level representation. (b) Wireframe representation. quality measure (Besl et al. 1988). Similarly, better robust estimators to real computer vision problems, tolerance of the LMedS smoothing operators to noise with noisy data not perfectly representing the assumed corrupting all the samples can be achieved (Mintz et model, should be preceded by a careful analysis of the al. 1990). problem. If a clear dichotomy exists between the "good" and "bad" data (as is usually the case in Table 1. Root mean square smoothing errors. statistics) methods like LMedS are extremely useful. Method LeastSquares Minimax Biweight LMedS If good and bad data cannot be discriminated, additional precautions should be taken. Error 944 922 855 392

Acknowledgments 6 Conclusion We would like to thank the anonymous reviewers whose We have presented a review of the different robust observations substantially improved the presentation of regression methods with special emphasis on the least the paper. The support of the Defense Advanced median of squares estimators. We did not present ap- Research Projects Agency and U.S. Army Night Vi- plications since most of them involve technical details sion and Electro Optics Laboratory under Grant beyond the goal of this article. A literature survey of DAAB07-86-K-F073, and of the National Science Foun- computer vision algorithms employing robust tech- dation under Grant DCR-86-03723, is gratefully niques was given in section 2.1. acknowledged. The desirable high breakdown point of the LMedS estimators is contingent upon the sought model being represented by weakly corrupted data. Using a prob- References abilistic speed-up technique, the computation of LMedS estimates is feasible, although more demanding than Andrews, D.E 1974. A robust method for multiple linear regression. Technometrics 16:523-531. that of M-estimates. The latter, however, requires a Beaton, A.E., and Tukey,J.W. 1974. The fitting of powerseries, mean- reliable initial estimate, possibly the output of an ing polynomials, illustrated on band-spectroscopic data. LMedS algorithm. We conclude that application of Technometrics 16:147-185. Robust Regression Methods for Computer Vision." A Review 69

Fig_ 4. Window smoothing of the image in figure 3. (a) Least squares operator. (b) Minimax M-estimator. (c) Biweight M-estimator. (d) LMedS estimator.

Besl, EJ., Birch, J.B., and Watson, L.T_ 1988. Robusl window Brown, G.W_, and Mood, A.M. 1951. On median tests for linear operators. Proc. 2nd Intern. Conf. Comput. Vision, Tampa, FL, hypotheses. Proc. 2nd Berkeley Syrup_ Math. Star. Prob. J. Neyman pp. 591-600. See also, Mach. Vision Appl. 2:179-214, 1989. (ed.), University of California Press: Berkeley and Los Angeles, Bolles, R.C., and Fischler, M.A. 1981. A RANSAC-based approach pp. 159-166. to model fitting and its application to finding cylinders in range Cheng, K.S., and Hettmansperger, T.P. 1983. Weighted least-squares data. Proc. 7th Intern. Joint Conf. Artif lntell., Vancouver, Canada, rank estimates. Commun. Star. A12:1069-I086. pp. 637-643. Coyle, EJ., Lin, J.H., and Gabbouj, M. 1989, Optimal stack fil- Bovik, A,C., Huang, T.S_, and Munson, Jr., D.C. 1987. The effect tering and the estimation and structural approaches to image proc- of median filtering on edge estimation and detection. IEEE Trans. essing. 1EEE Trans. Acoust. Speech, Sig. Process. 37:2037- Patt. Anal. Mach. lntell. PAMI-9:I81-194. 2066. 70 Meer, Mintz, Rosenfeld and Kim

Edelsbrunner, H., and Souvaine, D.L. 1988. Computing median-of- Kolvo, A.J., and Kim, C.W. 1989. Robust image modeling for squares regression lines and guided topological sweep. Report classification of surface defects on wood boards. IEEE Trans. Syst. UIUCDCS-R-88-1483. Department of Computer Science, Univer- Man, Cybern. 19:1659-1666, sity of Illinois at Urbana-Champaign. Kumar, R., and Hanson, A.R. 1989. Robust estimation of camera Efmn, B. 1988. Computer-intensive methods in statistical regression. location and orientation from noisy data having outliers. Proc. SIAM Review 30:421-449. Workshop on Interpretation of 3D Scenes, Austin, TX, pp. 52-60. Fischler, M.A., and Bolles, R.C. 1981. Random sample consensus: Lee, C.N., Haralick, R.M., and Zhuang, X. 1989. Recovering 3-D A paradigm for model fitting with applications to image analysis motion parameters from image sequences with gross errors. Proc. and automated cartography. Commu. ACM 24:381-395. Workshop on Visual Motion, Irvine, CA, pp. 46-53. Fitch, J.P., Coyle, E.J., and Gallagher, Jr., N.C. 1985. Root proper- Li, G. 1985. Robust regression. In D.C. Hoaglin, F. Mosteller, and ties and convergence rates of median filters," IEEE Trans. Acoust., J.W. Tukey (eds.), Exploring Data Tables, Trends and Shapes. Speech, Sig. Process. 33:230-240. Wiley: New York, pp. 281-343. Hampel, ER., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. Meer, P., Mintz, D., and Rosenfeld, A. 1990. Least median of squares 1986. Robust Statistics: An Approach Based on Influence Func- based robust analysis of image structure. Proc. DARPA Image tions. Wiley: New York. Understanding Workshop, Pittsburgh, PA, pp. 231-254. Hamlick, R.M., and Watson, L. 1981. A facet model for image data. Mintz, D., and Amir, A. 1989. Generalized random sampling. CAR- Comput. Graph. Image Process. 15:113-129. TR-441. Computer Vision Laboratory, University of Maryland, Col- Haralick, R.M., and Joo, H. 1988. 2D-3D pose estimation. Proc. lege Park. 9th Intern. Conf. Pan. Recog. Rome, pp. 385-391. See also, IEEE Mintz, D., Meet, P., and R_osenfeld, A. 1990. A fast, high breakdown Trans. Systems, Man, Cybern. 19:1426-1446, 1989. point robust estimator for computer vision applications. Proc. Heiler, S. 1981. Robust estimates in linear regression--A simulation DARPA Image Understanding Workshop, Pittsburgh, PA, pp. approach. In Computational Statistics. H. BiJning and P. Naeve 255-258. (eds.), De Gruyter, Berlin, pp. 115-136. Mosteller, E, and Tukey, J.W. 1977. Data Analysis and Regression. Holland, P.W., and Welsch, R.E. 1977. Robust regression using Addison-Wesley: Reading, MA. iteratively reweighted least squares. Commun. Stat. A6:813-828. Owens, R., Venkatesh, S., and Ross, J. 1989. Edge detection is a Huber, P.J. 1981. Robust Statistics. Wiley: New York. projection. Patt. Recog. Left. 9:233-244. Jaeckel, L.A. 1972. Estimating regression coefficients by minimiz- Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. ing the dispersion of residuals. Ann. Math. Stat. 43:1449-1458. 1988~ Numerical Recipes. Cambridge University Press, Cambridge, Johnstone, I.M., and Velleman, P.E 1985. The resistant line and England. related regression methods. J. Amer. Stat. Assoc. 80:1041-1059. Rousseeuw, P.J. 1984. Least median of squares regression. J. Amer. Jolion, J.M., Meet, P., and Rosenfeld, A. 1990a. Generalized Star. Assoc. 79:871-880. minimum volume ellipsoid clustering with applications in com- Rousseeuw, P.J., and Leroy A.M. 1987. Robust Regression & Outlier puter vision. Proc. Intern. Workshop Robust Comput. Vision, Seat- Detection. Wiley: New York. tle, WA, pp. 339-351. Siegel, A,E 1982. Robust regression using repeated medians. Jolion, J.M., Meer, P., and Bataouche, S., 1990b. Range image Biometrika 69:242-244. segmentation by robust clustering. CAR-TR-500. Computer Vision Steele, J.M., and Steiger, W.L. 1986. Algorithms and complexity Laboratory, University of Maryland, College Park. for least median of squares regression. Discrete Appl. Math. Kamgar-Parsi, B., Kamgar-Parsi, B., and Netanyahu, N.S. 1989. A 14:93-100. nonparametric method for fitting a straight line to a noisy image. Theil, H. 1950. A rank-invariant method of linear and polynomial IEEE Trans. Patt. Anal. Mach. lntell. PAMI-11:998-1001. regression analysis (parts 1-3). Ned. okad. Wetensch. Proc. ser. Kashyap, R.L., and Eom, K.B. 1988. Robust image models and their A 53:386-392, 521-525, 1397-1412. applications. In Advances in Electronics and Electron Physics, vol. Tirumalai, A., and Schunck, B.G. 1988. Robust surface approxima- 70, Academic Press: San Diego, CA, pp. 79-157. tion using least median squares regression. CSE-TR-13-89. Artificial Kim, D.Y., Kim, J,J., Meer, P., Minm, D., and Rosenfeld, A. 1989. Intelligence Laboratory, University of Michigan, Ann Arbor. Robust computer vision: A least median of squares based approach. Proc. DARPA Image Understanding Workshop, Palo Alto, CA, pp. 1117-1134.