A SHORT COURSE ON ROBUST STATISTICS
David E. Tyler Rutgers The State University of New Jersey
Web-Site www.rci.rutgers.edu/ dtyler/ShortCourse.pdf References
• Huber, P.J. (1981). Robust Statistics. Wiley, New York.
• Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.
• Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006). Robust Statistics: Theory and Methods. Wiley, New York. PART 1 CONCEPTS AND BASIC METHODS MOTIVATION
• Data Set: X1,X2,...,Xn
• Parametric Model: F (x1, . . . , xn | θ) θ: Unknown parameter F : Known function
2 • e.g. X1,X2,...,Xn i.i.d. Normal(µ, σ ) Q: Is it realistic to believe we don’t know (µ, σ2), but we know e.g. the shape of the tails of the distribution? A: The model is assumed to be approximately true, e.g. symmetric and unimodal (past experience). Q: Are statistical methods which are good under the model reasonably good if the model is only approximately true? ROBUST STATISTICS Formally addresses this issue. CLASSIC EXAMPLE: MEAN .vs. MEDIAN
population mean • Symmetric distributions: µ = population median
σ2 • Sample mean: X ≈ Normal µ, n ! 1 1 • Sample median: Median ≈ Normal µ, n 4f(µ)2
σ2 π • At normal: Median ≈ Normal µ, n 2 • Asymptotic Relative Efficiency of Median to Mean avar(X) 2 ARE(Median, X) = = = 0.6366 avar(Median) π CAUCHY DISTRIBUTION X ˜ Cauchy(µ, σ2)
1 x − µ2−1 f(x; µ, σ) = 1 + πσ σ
2 π2σ2 • Mean: X˜Cauchy(µ, σ ) • Median ≈ Normal µ, 4n • ARE(Median, X) = ∞ or ARE(X, Median) = 0 • For t on ν degrees of freedom: 4 Γ((ν + 1)/2)2 ARE(Median, X) = (ν − 2)π Γ(ν/2)
ν ≤ 2 3 4 5 ARE(Median, X) ∞ 1.621 1.125 0.960 ARE(X, Median) 0 0.617 0.888 1.041 MIXTURE OF NORMALS Theory of errors: Central Limit Theorem gives plausibility to normality. 2 Normal(µ, σ ) with probability 1 − X˜ 2 Normal(µ, (3σ) ) with probability i.e. not all measurements are equally precise.
= 0.10
2 2 X˜(1 − ) Normal(µ, σ ) + Normal(µ, (3σ) )
• Classic paper: Tukey (1960), A survey of sampling from contaminated distributions. ◦ For > 0.10 ⇒ ARE(Median, X) > 1 ◦ The mean absolute deviation is more effiicent than the sample standard deviation for > 0.01. PRINCETON ROBUSTNESS STUDIES Andrews, et.al. (1972) Other estimates of location.
• α-trimmed mean: Trim a proportion of α from both ends of the data set and then take the mean. (Throwing away data?)
• α-Windsorized mean: Replace a proportion of α from both ends of the data set by the next closest observation and then take the mean.
• Example: 2, 4, 5, 10, 200 Mean = 44.2 Median = 5 20% trimmed mean = (4 + 5 + 10) / 3 = 6.33 20% Windsorized mean = (4 + 4 + 5 + 10 + 10) / 5 = 6.6 Measuring the robustness of a statistics
• Relative Efficiency over a range of distributional models. – There exist estimates of location which are asymptotically most efficient for the center of any symmetric distribution. (Adaptive estimation, semi-parametrics). – Robust?
• Influence Function over a range of distributional models.
• Maximum Bias Function and the Breakdown Point. Measuring the effect of an outlier (not modeled)
• Good Data Set: x1, . . . , xn−1
Statistic: Tn−1 = T (x1, . . . , xn−1)
• Contaminated Data Set: x1, . . . , xn−1, x
Contaminated Value: Tn = T (x1, . . . , xn−1, x)
THE SENSITIVITY CURVE (Tukey, 1970)
1 SC (x) = n(T − T ) ⇔ T = T + SC (x) n n n−1 n n−1 n n
THE INFLUENCE FUNCTION (Hampel, 1969, 1974)
Population version of the sensitivity curve. THE INFLUENCE FUNCTION
• Statistic Tn = T (Fn) estimates T (F ). • Consider F and the -contaminated distribution,
F = (1 − )F + δx
where δx is the point mass distribution at x.
FF
• Compare Functional Values: T (F ) .vs. T (F) • Given Qualitative Robustness( Continuity):
T (F) → T (F ) as → 0 (e.g. the mode is not qualitatively robust) • Influence Function( Infinitesimal pertubation: Gateauxˆ Derivative)
T (F) − T (F ) ∂
IF (x; T,F ) = lim = T (F) →0 ∂ =0 EXAMPLES
• Mean: T (F ) = EF [X].
T (F) = EF(X)
= (1 − )EF [X] + E[δx] = (1 − )T [F ] + x
(1 − )T (F ) + x − T (F ) IF(x; T, F) = lim = x − T(F) →0
• Median: T (F ) = F −1(1/2)
IF(x; T, F) = {2 f(T(F))}−1sign(X − T(F)) Plots of Influence Functions Gives insight into the behavior of a statistic. 1 T ≈ T + IF (x; T,F ) n n−1 n
Mean Median
α-trimmed mean α-Winsorized mean (somewhat unexpected?) Desirable robustness properties for the influence function • SMALL – Gross Error Sensitivity GES(T ; F ) = sup | IF (x; T,F ) | x GES < ∞ ⇒ B-robust (Bias-robust) – Asymptotic Variance
Note : EF [ IF (X; T,F ) ] = 0 2 AV (T ; F ) = EF [ IF (X; T,F ) ] • Under general conditions, e.g. Frechet` differentiability, √ n(T (X1,...,Xn) − T (F )) → Normal(0, AV (T ; F ))
• Trade-off at the normal model: Smaller AV ↔ Larger GES • SMOOTH (local shift sensitivity): protects e.g. against rounding error. • REDESCENDING to 0. REDESCENDING INFLUENCE FUNCTION
• Example: Data Set of Male Heights in cm 180, 175, 192, ..., 185, 2020, 190, ... • Redescender = Automatic Outlier Detector CLASSES OF ESTIMATES L-statistics: Linear combination of order statistics
• Let X(1) ≥ ... ≥ X(n) represent the order statistics. n X T (X1,...,Xn) = ai,nX(i) i=1 where ai,n are constants.
• Examples: ◦ Mean. ai,n = 1/n n+1 1 n n ◦ Median. 1 i = 2 2 i = 2 , 2 + 1 ai,n = ai,n = n+1 0 i =6 2 0 otherwise for n odd for n even ◦ α-trimmed mean. ◦ α-Winsorized mean. • General form for the influence function exists • Can obtain any desirable monotonic shape, but not redescending • Do not readily generalize to other settings. M-ESTIMATES Huber (1964, 1967) Maximum likelihood type estimates under non-standard conditions
• One-Parameter Case. X1,...,Xn i.i.d. f(x; θ), θ ∈ Θ
• Maximum likelihood estimates n Y – Likelihood function. L(θ | x1, . . . , xn) = f(xi; θ) i=1 – Minimize the negative log-likelihood. n X min ρ(xi; θ) where ρ(xi; θ) = − log f(x : θ). θ i=1 – Solve the likelihood equations n X ∂ρ(xi; θ) ψ(xi; θ) = 0 where ψ(xi; θ) = i=1 ∂θ DEFINITIONS OF M-ESTIMATES
c Pn • Objective function approach: θ = arg min i=1 ρ(xi; θ)
Pn c • M-estimating equation approach: i=1 ψ(xi; θ) = 0.
– Note: Unique solution when ψ(x; θ) is strictly monotone in θ.
• Basic examples. 1 2 – Mean. MLE for Normal: f(x) = (2π)−1/2e−2(x−θ) for x ∈ <.
ρ(x; θ) = (x − θ)2 or ψ(x; θ) = x − θ
1 −|x−θ| – Median. MLE for Double Exponential: f(x) = 2e for x ∈ <.
ρ(x; θ) = | x − θ | or ψ(x; θ) = sign(x − θ)
• ρ and ψ need not be related to any density or to each other. • Estimates can be evaluated under various distributions. M-ESTIMATES OF LOCATION A symmetric and translation equivariant M-estimate. Translation equivariance
Xi → Xi + a ⇒ Tn → Tn + a gives ρ(x; t) = ρ(x − t) and ψ(x; t) = ψ(x − t)
Symmetric
Xi → −Xi ⇒ Tn → −Tn gives ρ(−r) = ρ(r) or ψ(−r) = −ψ(r).
Alternative derivation Generalization of MLE for center of symmetry for a given family of symmetric distributions.
f(x; θ) = g(|x − θ|) INFLUENCE FUNCTION OF M-ESTIMATES
M-FUNCTIONAL: T (F ) is the solution to EF [ψ(X; T (F ))] = 0.
IF (x; T,F ) = c(T,F )ψ(x; T (F ))
∂ψ(X;θ) where c(t, f) = −1/EF ∂θ evaluated at θ = T (F ).
Note: EF [ IF (X; T,F ) ] = 0. One can decide what shape is desired for the Influence Function and then construct an appropriate M-estimate.
⇒ Mean ⇒ Median EXAMPLE
Choose c r ≥ c ψ(r) = r | r | < c −c r ≤ −c where c is a tuning constant. Huber’s M-estimate Adaptively trimmed mean. i.e. the proportion trimmed depends upon the data. DERIVATION OF INFLUENCE FUNCTION FROM M-ESTIMATES Sketch
Let T = T (F), and so
0 = EF[ψ(X; T)] = (1 − )EF [ψ(X; T)] + ψ(x; T). Taking the derivative with respect to ⇒ ∂ ∂ 0 = −E [ψ(X; T )] + (1 − ) E [ψ(X; T )] + ψ(x; T ) + ψ(x; T ). F ∂ F ∂
Let ψ0(x, θ) = ∂ψ(x, θ)/∂θ. Using the chain rule ⇒ ∂T ∂T 0 = −E [ψ(X; T )] + (1 − )E [ψ0(X; T )] + ψ(x; T ) + ψ0(x; T ) . F F ∂ ∂
Letting → 0 and using qualitative robustness, i.e. T (F) → T (F ), then gives
0 0 = EF [ψ (X; T )]IF (x; T (F )) + ψ(x; T (F )) ⇒ RESULTS. ASYMPTOTIC NORMALITY OF M-ESTIMATES Sketch Let θ = T (F ). Using Taylor series expansion on ψ(x; θb) about θ gives n n n X X X 0 0 = ψ(xi; θb) = ψ(xi; θ) + (θb − θ) ψ (xi; θ) + ..., i=1 i=1 i=1 or n n √ 1 X √ 1 X 0 √ 0 = n{ ψ(xi; θ)} + n(θb − θ){ ψ (xi; θ)} + Op(1/ n). n i=1 n i=1 √ 1 Pn 2 By the CLT, n{n i=1 ψ(xi; θ)} →d Z ∼ Normal(0,EF [ψ(X; θ) ]).
1 Pn 0 0 By the WLLN, n i=1 ψ (xi; θ) →p EF [ψ (X; θ)].
Thus, by Slutsky’s theorem, √ 0 2 n(θb − θ) →d −Z/EF [ψ (X; θ)] ∼ Normal(0, σ ), where 2 2 0 2 2 σ = EF [ψ(X; θ) ]/EF [ψ (X; θ)] = EF [IF (X; TF ) ]
• NOTE: Proving Frechet` differentiability is not necessary. M-estimates of location Adaptively weighted means
• Recall translation equivariance and symmetry implies ψ(x; t) = ψ(x − t) and ψ(−r) = −ψ(r).
• Express ψ(r) = ru(r) and let wi = u(xi − θ), then n n n X X X 0 = ψ(xi − θ) = (xi − θ)u(xi − θ) = wi {xi − θ} i=1 i=1 i=1 Pn wi xi ⇒ θc = i=1 Pn i=1 wi
The weights are determined by the data cloud. ••••••• • •
z }| { θb Heavily Downweighted SOME COMMON M-ESTIMATES OF LOCATION Huber’s M-estimate
c r ≥ c ψ(r) = r | r | < c −c r ≤ −c
• Given bound on the GES, it has maximum efficiency at the normal model. • MLE for the “least favorable distribution”, i.e. symmetric unimodal model with smallest Fisher Information within a “neighborhood” of the normal. LFD = Normal in the middle and double exponential in the tails. Tukey’s Bi-weight M-estimate (or bi-square)
2 2 r u(r) = 1 − where a+ = max{0, a} 2 c +
• Linear near zero • Smooth (continuous second derivatives) • Strongly redescending to 0 • NOT AN MLE. CAUCHY MLE r/c ψ(r) = (1 + r2/c2)
NOT A STRONG REDESCENDER. COMPUTATIONS
• IRLS: Iterative Re-weighted Least Squares Algorithm. Pn wi,kxi θb = i=1 k+1 Pn i=1 wi,k
where wi,k = u(xi − θbk) and θbo is any intial value.
• Re-weighted mean = One-step M-estimate Pn wi,o xi θb = i=1 , 1 Pn i=1 wi,o
where θbo is a preliminary robust estimate of location, such as the median. PROOF OF CONVERGENCE Sketch
Pn Pn b Pn b i=1 wi,kxi i=1 wi,k(xi − θk) i=1 ψ(xi − θk) θbk+1 − θbk = − θbk = = Pn Pn Pn b i=1 wi,k i=1 wi,k i=1 u(xi − θk)
Note: If θbk > θb, then θbk > θbk+1 and θbk < θb, then θbk < θbk+1.
Decreasing objective function
2 0 00 Let ρ(r) = ρo(r ) and suppose ρo(s) ≥ 0 and ρo(s) ≤ 0. Then n n X X ρ(xi − θbk+1) < ρ(xi − θbk). i=1 i=1 • Examples: – Mean ⇒ ρ (s) = s o √ – Median ⇒ ρo(s) = s
– Cauchy MLE ⇒ ρo(s) = log(1 + s). • Include redescending M-estimates of location. • Generalization of EM algorithm for mixture of normals. PROOF OF MONOTONE CONVERGENCE Sketch
Let ri,k = (xi − θbk). By Taylor series with remainder term, 1 ρ (r2 ) = ρ (r2 ) + (r2 − r2 )ρ0 (r2 ) + (r2 − r2 )2ρ00(r2 ) o i,k+1 o i,k i,k+1 i,k o i,k 2 i,k+1 i,k o i,∗ So, n n n X 2 X 2 X 2 2 0 2 ρo(ri,k+1) < ρo(ri,k) + (ri,k+1 − ri,k)ρo(ri,k) i=1 i=1 i=1 Now, 0 0 2 0 2 ru(r) = ψ(r) = ρ (r) = 2rρo(r ) and so ρo(r ) = u(r)/2. Also, 2 2 b b 2 b b b (ri,k+1 − ri,k) = (θk+1 − θk) − 2(θk+1 − θk)(xi − θk). Thus, n n 1 n n X 2 X 2 b b 2 X X 2 ρo(ri,k+1) < ρo(ri,k) − (θk+1 − θk) u(ri,k) < ρo(ri,k) i=1 i=1 2 i=1 i=1 • Slow but Sure ⇒ Switch to Newton-Raphson after a few iterations. PART 2 MORE ADVANCED CONCEPTS AND METHODS SCALE EQUIVARIANCE • M-estimates of location alone are not scale equivariant in general, i.e.
Xi → Xi + a ⇒ θb → θb + a location equivariant
Xi → b Xi 6⇒ θb → b θb not scale equivariant (Exceptions: the mean and median.) • Thus, the adaptive weights are not dependent on the spread of the data
••••••• •
z }| { Equally Downweighted | {z } • • ••• • • • SCALE STATISTICS: sn
Xi → bXi + a ⇒ sn → | b |sn • Sample standard deviation. v u u 1 n u X 2 sn = t (xi − x¯) n − 1 i=1 • MAD (or more approriately MADAM). Median Absolute Deviation About the Median
∗ sn = Median| xi − median | ∗ 2 sn = 1.4826 sn →p σ at Normal(µ, σ ) Example 2, 4, 5, 10, 12, 14, 200 • Median = 10 • Absolute Deviations: 8, 6, 5, 0, 2, 4, 190
• MADAM = 5 ⇒ sn = 7.413 • (Standard deviation = 72.7661) M-estimates of location with auxiliary scale
n x − θ c X i θ = arg min ρ i=1 c sn
or
n x − θ c X i θ solves: ψ = 0 i=1 c sn
• c: tuning constant
• sn: consistent for σ at normal Tuning an M-estimate • Given a ψ-function, define a class of M-estimates via
ψc(r) = ψ(r/c)
• Tukey’s Bi-weight.
2 2 r r 2 2 ψ(r) = r{(1 − r )+} ⇒ ψc(r) = 1 − 2 c c +
• c → ∞ ⇒ ≈ Mean • c → 0 ⇒ Locally very unstable Normal distribution. Auxiliary scale. MORE THAN ONE OUTLIER • GES: measure of local robustness. • Measures of global robustness?
Xn = {X1,...,Xn}: n “good” data points
Ym = {Y1,...,Ym}: m “bad” data points
m Zn+m = Xn ∪ Ym: m-contaminated sample. m = n+m
Bias of a statistic: | T (Xn ∪ Ym) − T (Xn) | MAX-BIAS and THE BREAKDOWN POINT
• Max-Bias under m-contamination:
B(m; T, Xn) = sup | T (Xn ∪ Ym) − T (Xn) | Ym
• Finite sample contamination breakdown point: Donoho and Huber (1983)
∗ c(T ; Xn) = inf{m | B(m; T, Xn) = ∞}
• Other concepts of breakdown (e.g. replacement).
• The definition of BIAS can be modified so that BIAS → ∞ as T (Xn ∪ Ym) goes to the boundary of the parameter space. Example: If T represents scale, then choose
BIAS = | log{T (Xn ∪ Ym)} − log{T (Xn)} |. EXAMPLES
∗ 1 • Mean: c = n+1 ∗ • Median: c = 1/2 ∗ • α-trimmed mean: c = α ∗ • α-Windsorized mean: c = α • M-estimate of location with monotone and bounded ψ function:
∗ c = 1/2 PROOF (sketch of lower bound)
Let K = supr|ψ(r)|.
n+m n m X X X 0 = ψ(zi − Tn+m) = ψ(xi − Tn+m) + ψ(yi − Tn+m) i=1 i=1 i=1 n m X X | ψ(xi − Tn+m) | = | ψ(yi − Tn+m) | ≤ mK i=1 i=1
Breakdown occurs ⇒ | Tn+m | → ∞, say Tn+m → −∞ n n X X ⇒ | ψ(xi − Tn+m) | → | ψ(∞) | = nK i=1 i=1 ∗ Therefore, m ≥ n and so c ≥ 1/2.[] Population Version of Breakdown Point under contamination neighborhoods • Model Distribution: F • Contaminating Distribution: H
• -contaminated Distribution: F = (1 − )F + H • Max-Bias under -contamination:
B(; T,F ) = sup | T (F) − T (F ) | H • Breakdown Point: Hampel (1968) ∗(T ; F ) = inf{ | B(; T,F ) = ∞}
• Examples ∗ – Mean: T (F ) = EF (X) ⇒ (T ; F ) = 0 – Median: T (F ) = F −1(1/2) ⇒ ∗(T ; F ) = 1/2 ILLUSTRATION
GES ≈ ∂B(; T,F )/∂ |=0 ∗(T ; F ) = asymptote Heuristic Interpretation of Breakdown Point subject to debate
• Proportion of bad data a statistic can tolerate before becoming arbitrary or meaningless.
• If 1/2 the data is bad then one cannot distinguish between the good data and the bad data? ⇒ ≤ 1/2?
• Discussion: Davies and Gather (2005), Annals of Statistics. Example: Redescending M-estimates of location with fixed scale
n | xi − t | X T (n1, . . . , xn) = arg min ρ t i=1 c
Breakdown point. Huber (1984). For bounded increasing ρ,
∗ 1 − A(Xn; c)/n c(T ; F ) = 2 − A(Xn; c)/n
Pn xi−t where A(Xn; c) = mint i=1 ρ( c ) and sup ρ(r) = 1.
• Breakdown point depends on Xn and c • ∗ : 0 → 1/2 as c : 0 → ∞
• For large c, T (n1, . . . , xn) ≈ Mean!! Explanation: Relationship between redescending M-estimates of location and kernel density estimates Chu, Glad, Godtliebsen and Marron (1998)
• Objective function: n X xi − t ρ i=1 c • Kernel density estimate: n X xi − t fc(x) ∝ κ i=1 h • Relationship: κ ∝ 1 − ρ and c = window width h
2 2 • Example: κ(r) = √1 exp−r /2 ρ(r) = 1 − exp−r /2 2π ⇒ Normal kernel Welsch’s M-estimate • Example: Epanechnikov kernel ⇒ skipped mean • “Outliers” less compact than “good” data ⇒ Breakdown will not occur. • Not true for monotonic M-estimates. PART 3 ROBUST REGRESSION AND MULTIVARIATE STATISTICS REGRESSION SETTING
• Data: (Yi, Xi) i = 1, . . . , n
– Yi ∈ < Response p – Xi ∈ < Predictors
• Predict: Y by X0β
0 • Residual for a given β: ri(β) = yi − xiβ
• M-estimates of regression – Generalizion of MLE for symmetric error term 0 – Yi = Xiβ + i where i are i.i.d. symmetric. M-ESTIMATES OF REGRESSION • Objective function approach: n X min ρ(ri(β)) β i=1 where ρ(r) = ρ(−r) and ρ ↑ for r ≥ 0. • M-estimating equation approach: n X ψ(ri(β))xi = 0 i=1 e.g. ψ(r) = ρ0(r) • Interpretation: Adaptively weighted least squares.
◦ Express ψ(r) = ru(r) and wi = u(ri(βc))
n n c X 0 −1 X β = [ wixixi] [ wiyixi] i=1 i=1 • Computations via IRLS algorithm. INFLUENCE FUNCTIONS FOR M-ESTIMATES OF REGRESSION
IF (y, x; T,F ) ∝ ψ(r) x • r = y − x0T (F ) • Due to residual: ψ(r) • Due to Design: x • GES = ∞, i.e. unbounded influence.
• Outlier 1: highly influential for any ψ function. • Outlier 2: highly influential for monotonic but not redescending ψ. BOUNDED INFLUENCE REGRESSION • GM-estimates (Generalized M-estimates). ◦ Mallows-type (Mallows, 1975): n X w(xi) ψ (ri(β)) xi = 0 i=1 Downweights outlying design points (leverage points), even if they are good leverage points. ◦ General form (c.g. Maronna and Yohai, 1981): n X ψ (xi, ri(β)) xi = 0 i=1 • Breakdown points. – M-estimates: ∗ = 0 – GM-estimates: ∗ ≤ 1/(p + 1) LATE 70’s - EARLY 80’s
• Open problem: Is high breakdown point regression possible?
• Yes. Repeated Median. Siegel (1982). – Not regression equivariate
• Regression equivariance: For a ∈ < and A nonsingular 0 −1 (Yi, Xi) → (aYi,A Xi) ⇒ βc → aA βc i.e. Yci → aYci • Open problem: Is high breakdown point equivariate regression possible?
• Yes. Least Median of Squares. Hampel (1984), Rousseeuw, (1984) Least Median of Squares (LMS) Hampel (1984), Rousseeuw, (1984)
2 min Median{ri(β) | i = 1, . . . , n} β
• Alternatively: minβ MAD{ ri(β) } • Breakdown point: ∗ = 1/2 • Location version: SHORTH (Princeton Robustness Study)
Midpoint of the SHORTest Half. LMS: Mid-line of the shortest strip containing 1/2 of the data.
√ √ • Problem: Not n-consistent, but only 3 n-consistent √ n|| βc − β || →p ∞ √ 3 n|| βc − β || = Op(1) • Not locally stable. e.g. Example is pure noise. S-ESTIMATES OF REGRESSION Rousseeuw and Yohai (1984)
• For S(·), an estimate of scale (about zero): minβ S(ri(β)) • S = MAD ⇒ LMS • S = sample standard deviation about 0 ⇒ Least Squares • Bounded monotonic M-estimates of scale (about zero): n X χ (| ri |/s) = 0 i=1 for χ ↑, and bounded above and below. Alternatively, n 1 X ρ (| ri |/s) = n i=1 for ρ ↑, and 0 ≤ ρ ≤ 1 • For LMS: ρ : 0 - 1 jump function and = 1/2 • Breakdown point: ∗ = min(, 1 − ) S-estimates of Regression √ • n- consistent and Asymptotically normal. • Trade-off: Higher breakdown point ⇒ Lower efficiency and higher residual gross error sensivity for Normal errors.
• One Resolution: One-step M-estimate via IRLS. (Note: R, S-plus, Matlab.) M-ESTIMATES OF REGRESSION WITH GENERAL SCALE Martin, Yohai, and Zamar (1989)
0 n | yi − x β | X i min ρ β i=1 c sn
• Parameter of interest: β
• Scale statistic: sn (consistent for σ at normal errors)
• Tuning constant: c
• Monotonic bounded ρ ⇒ redescending M-estimate
• High Breakdown Point Examples – LMS and S-estimates – MM-estimates, Yohai (1987). – CM-estimates, Mendes and Tyler (1994). 0 n | yi − x β | X i sn .vs. ρ i=1 c sn
• Each β gives one curve
• S minimizes horizonally • M minimizes vertically MM-ESTIMATES OF REGRESSION Default robust regression estimate in S-plus (also SAS?) • Given a preliminary estimate of regression with breakdown point 1/2. Usually an S-estimate of regression.
• Compute a monotone M-estimate of scale about zero for the residuals. sn: Usually from the S-estimate.
• Compute an M-estimate of regression using the scale statistics sn and with any desired tuning constant c.
• Tune to obtain reasonable ARE and residual GES.
• Breakdown point: ∗ = 1/2 TUNING
• High breakdown point S-estimates are badly tuned M-estimates. • Tuning the S-estimates affects its breakdown point. • MM-estimates can be tuned without affecting the breakdown point COMPUTATIONAL ISSUES
• All known high breakdown point regression estimates are computationally inten- sive. Non-convex optimization problem.
• Approximate or stochastic algorithms.
• Random Elemental Subset Selection. Rousseeuw (1984). n – Consider exact fit of lines for p of the data points ⇒ such lines. p – Optimize the criterion over such lines. – For large n and p, randomly sample from such lines so that there is a good chance, say e.g. 95% chance, that one of the elemental subsets will contain only good data even if half the data is contaminated. MULTIVARIATE DATA Robust Multivariate Location and Covariance Estimates Parallel developments • Multivariate M-estimates. Maronna (1976). Huber (1977). ◦ Adaptively weighted mean and covariances. ◦ IRLS algorithms. ◦ Breakdown point: ∗ ≤ 1/(d + 1), where d is the dimension of the data.
• Minimum Volume Ellipsoid Estimates. Rousseeuw (1985). ◦ MVE is a multivariate version of LMS.
• Multivariate S-estimates. Davies (1987), Lopuhuaa¨ (1989)
• Multivariate CM-estimates. Kent and Tyler (1996).
• Multivariate MM-estimates. Tatsuoka and Tyler (2000). Tyler (2002). MULTIVARIATE DATA
• d-dimensional data set: X1,... Xn
• Classical summary statistics:
1 Pn – Sample Mean Vector: X = n i=1 Xi – Sample Variance-Covarinance Matrix 1 n X T Sn = (Xi − X)(Xi − X) = {sij} n i=1 where sii = sample variance, sij = sample covariance.
• Maximum likelihood estimates under normality Sample Mean and Covariance
●
● 6.0 ●
● ●
● ● ● ●
5.5 ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● 5.0 ● ● ●
● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●
●
4.0 ●
3.6 3.8 4.0 4.2 4.4 4.6
Hertzsprung−Russell Galaxy Data (Rousseeuw) Visualization ⇒ Ellipse containing half the data.
T −1 (X − X) Sn (X − X) ≤ c 2 T −1 Mahalanobis distances: di = (Xi − X) Sn (Xi − X)
● ● ● 3.0 ● 2.5
●
● 2.0 ●
● ● ●
MahDis ●
1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 0.5 ● ● ● ● ●
0 10 20 30 40
Index
−1 T −1 Mahalanobis angles: θi,j = Cos (Xi − X) Sn (Xj − X)/(didj) 1.0 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ●● ●● ● PC2 ●
● ● −0.5 ● ● ● ● −1.0
● ● ● ● ● ●● ● ● ● ● ●●●●●●●●● ●● ● ● ● ●●● ● ● ●●● ●●●● ● ●
−1.0 −0.5 0.0 0.5 1.0
PC1
Principal Components: Do not always reveal outliers. ENGINEERING APPLICATIONS
• Noisy signal arrays, radar clutter. Test for signal, i.e. detector, based on Mahalanobis angle be- tween signal and observed data.
• Hyperspectral Data. Adaptive cosine estimator = Mahalanobis angle.
• BIG DATA
• Can not clean data via observations. Need automatic methods. ROBUST MULTIVARIATE ESTIMATES Location and Scatter (Pseudo-Covariance)
• Trimmed versions: complicated
• Weighted mean and covariance matrices
Pn wi Xi 1 n i=1 d X T µd = Σ = wi (Xi − µd)(Xi − µd) Pn i=1 i=1 wi n
−1 2 T d where wi = u(di,o) and di,o = (Xi − µdo) Σo (Xi − µdo)
d and with µdo and Σo being initial estimates, e.g. → Sample mean vector and covariance matrix. → High breakdown point estimates. MULTIVARIATE M-ESTIMATES
• MLE type for elliptically symmetric distributions
• Adaptively weighted mean and covariances
Pn wi Xi 1 n i=1 d X T µd = Σ = wi (Xi − µd)(Xi − µd) Pn i=1 i=1 wi n
−1 2 T d d • wi = u(di) and di = (Xi − µd) Σ (Xi − µ)
• Implicit equations PROPERTIES OF M-ESTIMATES
• Root-n consistent and asymptotically normal.
• Can be tuned to have desirable properties: – high efficiency over a broad class of distributions – smooth and bounded influence function
• Computationally simple (IRLS)
• Breakdown point: ∗ ≤ 1/(d + 1) HIGH BREAKDOWN POINT ESTIMATES
• MVE: Minimum Volume Ellipsoid ⇒ ∗ = 0.5 Center and scatter matrix for smallest ellipse covering at least half of the data.
• MCD, S-estimates, MM-estimates, CM-estimates ⇒ ∗ = 0.5
• Reweighted versions based on high breakdown start.
• Computationally intensive: Approximate/probabilistic algorithms. Elemental Subset Approach. Sample Mean and Covariance
●
● 6.0 ●
● ●
● ● ● ●
5.5 ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● 5.0 ● ● ●
● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●
●
4.0 ●
3.6 3.8 4.0 4.2 4.4 4.6
Hertzsprung−Russell Galaxy Data (Rousseeuw) Add MVE
●
● 6.0 ●
● ●
● ● ● ●
5.5 ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● 5.0 ● ● ●
● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●
●
4.0 ●
3.6 3.8 4.0 4.2 4.4 4.6
Add Cauchy MLE
●
● 6.0 ●
● ●
● ● ● ●
5.5 ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● 5.0 ● ● ●
● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●
●
4.0 ●
3.6 3.8 4.0 4.2 4.4 4.6
ELLIPTICAL DISTRIBUTIONS and AFFINE EQUIVARIANCE SPHERICALLY SYMMETRIC DISTRIBUTIONS
Z = RU where R ⊥ U, R > 0,
and U ∼d Uniform on the unit sphere.
Density: f(z) = g(z0z)
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● 1.0 ●● ●● ●●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●● ●●●●●●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●●● ●●●● ●●●● ●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●●●●●● ●●●● ●●● ●●●●●● ●●●●●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●●●● ●●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●● ●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●● ●● ●●● ●●● ●●●●●●●●● ●●●●●●●●● ●●● ●●● ●● ●●● ●●●●●●● ●●●●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●● ●●●●● ●●● ●● 0.5 ●● ●● ●●●● ●●●● ●● ●● ●● ●●● ●●●● ●●●● ●●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●●●● ●●●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●● ●●●●● ●●● ●● ●● ● ●● ●● ●●●● ●●●● ●● ●● ● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ● ●● ●● ●●● ●●● ●● ●● ● ● ●● ●● ●●● ●●● ●● ●● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ● ●● ●● ●●●●●●●● ●●●●●●●● ●● ●● ● ● ● ● ● ●● ●●●●● ●●●●● ●● ● ● ● ● ● ● ●● ●●●● ●●●● ●● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ● ● ●● ●●●● ●●●● ●● ● ● ● ● ● ● ●● ●●●●● ●●●●● ●● ● ● ● ● ● ●● ●● ●●●●●●● ●●●●●●● ●● ●● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ●● ●● ●●● ●●● ●● ●● ● ● ●● ●● ●●● ●●● ●● ●● ● ● ●● ●● ●●●● ●●●● ●● ●● ● ● ●● ●● ●●●● ●●●● ●● ●● ● ●● ●● ●●● ●●●●● ●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●●●● ●●●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●●● ●●●● ●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●●●● ●●●●●●● ●●● ●● −0.5 ●● ●● ●●●●● ●●●●● ●● ●● ●● ●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●● ●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ●●● ●●●● ●●●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●●●●● ●●●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●●● ●●●● ●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −1.0
−1.0 −0.5 0.0 0.5 1.0
x ELLIPTICAL SYMMETRIC DISTRIBUTIONS
X = AZ + µ ∼ E(µ, Γ; g), where Γ = AA0
0 ! ⇒ f(x) = |Γ|−1/2g (x − µ) Γ−1(x − µ)
2.5 ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●● ●●●●●●●● ●●●● ●●●●●●● ●●●● ●●●●●●● ●●● ●●●●●● ●●● ●●●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●●●●●●● ●●●●●●●● ●● ●●●●● ●●●●●●●● ●●●●●●● ●● ●●●●● ●●●●●●● ●●●●● ●● ●●●●● ●●●●●● ●●●● ●● ●●●● ●●●●●● ●●● ● 2.0 ●● ●●● ●● ● ●●● ●●●●● ●●● ● ●●● ●●●●● ●● ● ●●● ●●●●● ●●●●●●●●●●●●●● ●● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●● ● ●●● ●●●● ●●●●●●●●● ●●●●●●● ●● ● ●●● ●●●● ●●●●●●● ●●●●● ● ● ●●● ●●●● ●●●●●● ●●●● ● ● ●●● ●●●● ●●●●●● ●●● ● ● ●●● ●●●● ●●●●● ●●● ● ● ●●● ●●●● ●●●●● ●● ● ● ●●● ●●●● ●●●●● ●● ● ● ●●● ●●●● ●●●● ●● ● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●● ●●●● ●●●●●●●●●● ●●●●●●● ● ● ● ●●● ●●● ●●●● ●●●●●●●● ●●●● ● ● ● ●●● ●●●● ●●●● ●●●●●●● ●●● ● ● ● 1.5 ●● ●● ●●● ●●●● ●● ● ● ● ●●● ●●● ●●● ●●●●●● ●● ● ● ● ●●● ●●● ●●●● ●●●●● ●● ● ● ● ●●● ●●● ●●●● ●●●● ● ● ● ● ●●● ●●● ●●● ●●●● ● ● ● ● ●●● ●●● ●●● ●●●● ● ● ● ● ●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●● ●●● ●●● ●●● ●●●●●●● ●●●● ● ● ● ●● ●● ●● ●●● ●●● ●●●●● ●● ● ● ●● ●● ●● ●●● ●●● ●●● ●●●●● ● ● ●● ●● ●● ●● ●● ●●● ●●● ●●●● ● ● ●● ●● ●● ●● ●● ●● ●●● ●●●● ● ●● ●● ●● ●● ●● ●● ●● ●●● ●●● ● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●● ●● ●● ●● 1.0 ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ●●● ●●● ●● ●● ●● ●● ●● ●● ●● ● ●●● ●●● ●● ●● ●● ●● ●● ●● ● ● ●●●● ●●● ●●● ●● ●● ●● ●● ●● ● ● ●●●● ●●● ●●● ●● ●● ●● ●● ●● ● ●● ●●●●● ●●● ●●● ●●● ●● Eliplot()[,2] ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ●●●●●● ●●●● ●●● ●● ●● ●● ● ● ● ●●●●●● ●●●●●●●●●● ●●●● ●●● ●●● ●●● ● ● ● ● ●●●●●●●●●●●●● ●●●● ●●● ●●● ●●● ● ● ● ● ●●●● ●●● ●●● ●●● ● ● ● ● ●●●● ●●●● ●●● ●●● ● ● ● ● ●●●●● ●●● ●● ●●● ● ● ● ●● ●●●●● ●●●● ●●● ●●● ● ● ● ●● ●●●●●● ●●●● ●●● ●●● ● ● ● ●●● ●●●●●●● ●●●● ●●● ●●● ● ● ● ●●●● ●●●●●●●● ●●●● ●●●● ●●● ● ● ● ●●●●●●● ●●●●●●●●●● ●●●● ●●● ●●● 0.5 ● ● ● ●●●●●●●● ●●●●●●●●● ●● ●● ●● ● ● ● ●●●●●●●●●●●● ●●●● ●●●● ●●● ● ● ●● ●●●● ●●●● ●●● ● ● ●● ●●●●● ●●●● ●●● ● ● ●● ●●●●● ●●●● ●●● ● ● ●●● ●●●●● ●●●● ●●● ● ● ●●●● ●●●●●● ●●●● ●●● ● ● ●●●● ●●●●●●● ●●●● ●●● ● ●● ●●●●●● ●●●●●●●● ●●●● ●●● ● ●● ●●●●●●●●● ●●●●●●●●●●● ●●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ● ●● ●●●●● ●●● ● ●●● ●●●●● ●●● ● ●●● ●●●●●● ●●● ●● ●●●● ●●●●●● ●●●●● ●● ●●●● ●●●●●●● ●●●●● ●● ●●●●● ●●●●●●● ●●●●● 0.0 ● ● ●● ● ●● ●●●●●●● ●●●●●●●●● ●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●●●●●● ●●● ●●●●●● ●●●● ●●●●●●● ●●●● ●●●●●●●● ●●●● ●●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −0.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
Eliplot()[,1] AFFINE EQUIVARIANCE
• Parameters of elliptical distributions: X ∼ E(µ, Γ; g) ⇒ X∗ = BX + b ∼ E(µ∗, Γ∗; g) where µ∗ = Bµ + b and Γ∗ = BΓB0
• Sample version: ∗ Xi → Xi = BX + b, i = 1, . . . n ∗ d d ∗ d 0 ⇒ µd → µd = Bµd + b and V → V = BVB
• Examples – Mean vector and Covariance matrix – M-estimates – MVE ELLIPTICAL SYMMETRIC DISTRIBUTIONS
0 ! f(x; µ, Γ) = |Γ|−1/2g (x − µ) Γ−1(x − µ) .
• If second moments exist, shape matrix Γ ∝ covariance matrix. • For any affine equivariant scatter functional: Σ(F ) ∝ Γ • That is, any affine equivariant scatter statistic estimates a matrix proportional to the shape matrix Γ.
NON-ELLIPTICAL DISTRIBUTIONS
• Difference scatter matrices estimate different population quantities. • Analogy: For non-symmetric distributions: population mean 6= population median. OTHER TOPICS
Projection-based multivariate approaches
• Tukey’s data depth. (or half-space depth) Tukey (1974).
• Stahel-Donoho’s Estimate. Stahel (1981). Donoho 1982.
• Projection-estimates. Maronna, Stahel and Yohai (1994). Tyler (1996).
• Computationally Intensive. TUKEY’S DEPTH PART 5 ROBUSTNESS AND COMPUTER VISION
• Independent development of robust statistics within the com- puter vision/image understanding community.
• Hough transform: ρ = xcos(θ) + ysin(θ) • y = a + bx where a = ρ/sin(θ) and b = −cos(θ)/sin(θ) • That is, intersection refers to the line connecting two points. • Hough: Estimation in Data Space to Clustering in Feature Space Find centers of the clusters • Terminology: – Feature Space = Parameter Space – Accumulator = Elemental Fit • Computation: RANSAC (Random sample consensus) – Randomly choose a subset from the Accumulator. (Random elemental fits.) – Check to see how many data points are within a fixed neigh- borhood are in a neighborhood. Alternative Formulation of Hough Transform/RANSAC
1, | r | > R P ρ(r ) should be small, where ρ(r) = i 0, | r | ≤ R
That is, a redescending M-estimate of regression with known scale. Note: Relationship to LMS. Vision: Need e.g. 90% breakdown point, i.e. tolerate 90% bad data Defintion of Residuals? Hough transform approach does not distinguish between: • Regression line for regressing Y on X. • Regression line for regressing X on Y. • Orthogonal regression line. (Note: Small stochastic errors ⇒ little difference.) GENERAL PARADIGM • Line in 2-D: Exact fit to all pairs • Quadratic in 2-D: Exact fit to all triples
• Conic Sections: Ellipse Fitting (x − µ)0A(x − µ) = 1 0 2 2 ◦ Linearizing: Let x∗ = (x1, x2, x1, x2, x1x2, 1) 0 ◦ Ellipse: a x∗ = 0 ◦ Exact fit to all subsets of size 5. Hyperboloid Fitting in 4-D • Epipolar Geometry: Exact fit to ◦ 5 pairs: Calibrated cameras (location, rotation, scale). ◦ 8 pairs: Uncalibrated cameras (focal point, image plane). ◦ Hyperboloid surface in 4-D. • Applications: 2D → 3D, Motion. • OUTLIERS = MISMATCHES