<<

A SHORT COURSE ON ROBUST

David E. Tyler Rutgers The State University of New Jersey

Web-Site www.rci.rutgers.edu/ dtyler/ShortCourse.pdf References

• Huber, P.J. (1981). . Wiley, New York.

• Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.

• Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006). Robust Statistics: Theory and Methods. Wiley, New York. PART 1 CONCEPTS AND BASIC METHODS MOTIVATION

Set: X1,X2,...,Xn

: F (x1, . . . , xn | θ) θ: Unknown parameter F : Known function

2 • e.g. X1,X2,...,Xn i.i.d. Normal(µ, σ ) Q: Is it realistic to believe we don’t know (µ, σ2), but we know e.g. the shape of the tails of the distribution? A: The model is assumed to be approximately true, e.g. symmetric and unimodal (past experience). Q: Are statistical methods which are good under the model reasonably good if the model is only approximately true? ROBUST STATISTICS Formally addresses this issue. CLASSIC EXAMPLE: .vs.

  population mean • Symmetric distributions: µ =  population median

 σ2  • Sample mean: X ≈ Normal µ, n ! 1 1 • Sample median: Median ≈ Normal µ, n 4f(µ)2

 σ2 π  • At normal: Median ≈ Normal µ, n 2 • Asymptotic Relative Efficiency of Median to Mean avar(X) 2 ARE(Median, X) = = = 0.6366 avar(Median) π CAUCHY DISTRIBUTION X ˜ Cauchy(µ, σ2)

1  x − µ2−1 f(x; µ, σ) = 1 +    πσ  σ 

2  π2σ2  • Mean: X˜Cauchy(µ, σ ) • Median ≈ Normal µ, 4n • ARE(Median, X) = ∞ or ARE(X, Median) = 0 • For t on ν degrees of freedom: 4 Γ((ν + 1)/2)2 ARE(Median, X) = (ν − 2)π Γ(ν/2)

ν ≤ 2 3 4 5 ARE(Median, X) ∞ 1.621 1.125 0.960 ARE(X, Median) 0 0.617 0.888 1.041 MIXTURE OF NORMALS Theory of errors: gives plausibility to normality.  2  Normal(µ, σ ) with probability 1 −  X˜ 2  Normal(µ, (3σ) ) with probability  i.e. not all measurements are equally precise.

 = 0.10

2 2 X˜(1 − ) Normal(µ, σ ) +  Normal(µ, (3σ) )

• Classic paper: Tukey (1960), A survey of from contaminated distributions. ◦ For  > 0.10 ⇒ ARE(Median, X) > 1 ◦ The mean absolute is more effiicent than the sample for  > 0.01. PRINCETON ROBUSTNESS STUDIES Andrews, et.al. (1972) Other estimates of location.

• α-trimmed mean: Trim a proportion of α from both ends of the and then take the mean. (Throwing away data?)

• α-Windsorized mean: Replace a proportion of α from both ends of the data set by the next closest observation and then take the mean.

• Example: 2, 4, 5, 10, 200 Mean = 44.2 Median = 5 20% trimmed mean = (4 + 5 + 10) / 3 = 6.33 20% Windsorized mean = (4 + 4 + 5 + 10 + 10) / 5 = 6.6 Measuring the robustness of a statistics

• Relative Efficiency over a of distributional models. – There exist estimates of location which are asymptotically most efficient for the center of any symmetric distribution. (Adaptive estimation, semi-parametrics). – Robust?

• Influence Function over a range of distributional models.

• Maximum Bias Function and the Breakdown Point. Measuring the effect of an (not modeled)

• Good Data Set: x1, . . . , xn−1

Statistic: Tn−1 = T (x1, . . . , xn−1)

• Contaminated Data Set: x1, . . . , xn−1, x

Contaminated Value: Tn = T (x1, . . . , xn−1, x)

THE SENSITIVITY CURVE (Tukey, 1970)

1 SC (x) = n(T − T ) ⇔ T = T + SC (x) n n n−1 n n−1 n n

THE INFLUENCE FUNCTION (Hampel, 1969, 1974)

Population version of the sensitivity curve. THE INFLUENCE FUNCTION

Tn = T (Fn) estimates T (F ). • Consider F and the -contaminated distribution,

F = (1 − )F + δx

where δx is the point mass distribution at x.

FF

• Compare Functional Values: T (F ) .vs. T (F) • Given Qualitative Robustness( Continuity):

T (F) → T (F ) as  → 0 (e.g. the is not qualitatively robust) • Influence Function( Infinitesimal pertubation: Gateauxˆ Derivative)

T (F) − T (F ) ∂

IF (x; T,F ) = lim = T (F) →0  ∂ =0 EXAMPLES

• Mean: T (F ) = EF [X].

T (F) = EF(X)

= (1 − )EF [X] + E[δx] = (1 − )T [F ] +  x

(1 − )T (F ) + x − T (F ) IF(x; T, F) = lim = x − T(F) →0 

• Median: T (F ) = F −1(1/2)

IF(x; T, F) = {2 f(T(F))}−1sign(X − T(F)) Plots of Influence Functions Gives insight into the behavior of a statistic. 1 T ≈ T + IF (x; T,F ) n n−1 n

Mean Median

α-trimmed mean α- (somewhat unexpected?) Desirable robustness properties for the influence function • SMALL – Gross Error Sensitivity GES(T ; F ) = sup | IF (x; T,F ) | x GES < ∞ ⇒ B-robust (Bias-robust) – Asymptotic

Note : EF [ IF (X; T,F ) ] = 0 2 AV (T ; F ) = EF [ IF (X; T,F ) ] • Under general conditions, e.g. Frechet` differentiability, √ n(T (X1,...,Xn) − T (F )) → Normal(0, AV (T ; F ))

• Trade-off at the normal model: Smaller AV ↔ Larger GES • SMOOTH (local shift sensitivity): protects e.g. against rounding error. • REDESCENDING to 0. REDESCENDING INFLUENCE FUNCTION

• Example: Data Set of Male Heights in cm 180, 175, 192, ..., 185, 2020, 190, ... • Redescender = Automatic Outlier Detector CLASSES OF ESTIMATES L-statistics: Linear combination of order statistics

• Let X(1) ≥ ... ≥ X(n) represent the order statistics. n X T (X1,...,Xn) = ai,nX(i) i=1 where ai,n are constants.

• Examples: ◦ Mean. ai,n = 1/n  n+1  1 n n   ◦ Median.  1 i = 2  2 i = 2 , 2 + 1 ai,n = ai,n =  n+1   0 i =6 2  0 otherwise for n odd for n even ◦ α-trimmed mean. ◦ α-Winsorized mean. • General form for the influence function exists • Can obtain any desirable monotonic shape, but not redescending • Do not readily generalize to other settings. M-ESTIMATES Huber (1964, 1967) Maximum likelihood type estimates under non-standard conditions

• One-Parameter Case. X1,...,Xn i.i.d. f(x; θ), θ ∈ Θ

• Maximum likelihood estimates n Y – . L(θ | x1, . . . , xn) = f(xi; θ) i=1 – Minimize the negative log-likelihood. n X min ρ(xi; θ) where ρ(xi; θ) = − log f(x : θ). θ i=1 – Solve the likelihood equations n X ∂ρ(xi; θ) ψ(xi; θ) = 0 where ψ(xi; θ) = i=1 ∂θ DEFINITIONS OF M-ESTIMATES

c Pn • Objective function approach: θ = arg min i=1 ρ(xi; θ)

Pn c • M-estimating equation approach: i=1 ψ(xi; θ) = 0.

– Note: Unique solution when ψ(x; θ) is strictly monotone in θ.

• Basic examples. 1 2 – Mean. MLE for Normal: f(x) = (2π)−1/2e−2(x−θ) for x ∈ <.

ρ(x; θ) = (x − θ)2 or ψ(x; θ) = x − θ

1 −|x−θ| – Median. MLE for Double Exponential: f(x) = 2e for x ∈ <.

ρ(x; θ) = | x − θ | or ψ(x; θ) = sign(x − θ)

• ρ and ψ need not be related to any density or to each other. • Estimates can be evaluated under various distributions. M-ESTIMATES OF LOCATION A symmetric and translation equivariant M-estimate. Translation equivariance

Xi → Xi + a ⇒ Tn → Tn + a gives ρ(x; t) = ρ(x − t) and ψ(x; t) = ψ(x − t)

Symmetric

Xi → −Xi ⇒ Tn → −Tn gives ρ(−r) = ρ(r) or ψ(−r) = −ψ(r).

Alternative derivation Generalization of MLE for center of symmetry for a given family of symmetric distributions.

f(x; θ) = g(|x − θ|) INFLUENCE FUNCTION OF M-ESTIMATES

M-FUNCTIONAL: T (F ) is the solution to EF [ψ(X; T (F ))] = 0.

IF (x; T,F ) = c(T,F )ψ(x; T (F ))

∂ψ(X;θ) where c(t, f) = −1/EF ∂θ evaluated at θ = T (F ).

Note: EF [ IF (X; T,F ) ] = 0. One can decide what shape is desired for the Influence Function and then construct an appropriate M-estimate.

⇒ Mean ⇒ Median EXAMPLE

  Choose  c r ≥ c  ψ(r) =  r | r | < c    −c r ≤ −c where c is a tuning constant. Huber’s M-estimate Adaptively trimmed mean. i.e. the proportion trimmed depends upon the data. DERIVATION OF INFLUENCE FUNCTION FROM M-ESTIMATES Sketch

Let T = T (F), and so

0 = EF[ψ(X; T)] = (1 − )EF [ψ(X; T)] +  ψ(x; T). Taking the derivative with respect to  ⇒ ∂ ∂ 0 = −E [ψ(X; T )] + (1 − ) E [ψ(X; T )] + ψ(x; T ) +  ψ(x; T ). F  ∂ F   ∂ 

Let ψ0(x, θ) = ∂ψ(x, θ)/∂θ. Using the chain rule ⇒ ∂T ∂T 0 = −E [ψ(X; T )] + (1 − )E [ψ0(X; T )]  + ψ(x; T ) + ψ0(x; T )  . F  F  ∂   ∂

Letting  → 0 and using qualitative robustness, i.e. T (F) → T (F ), then gives

0 0 = EF [ψ (X; T )]IF (x; T (F )) + ψ(x; T (F )) ⇒ RESULTS. ASYMPTOTIC NORMALITY OF M-ESTIMATES Sketch Let θ = T (F ). Using Taylor series expansion on ψ(x; θb) about θ gives n n n X X X 0 0 = ψ(xi; θb) = ψ(xi; θ) + (θb − θ) ψ (xi; θ) + ..., i=1 i=1 i=1 or n n √ 1 X √ 1 X 0 √ 0 = n{ ψ(xi; θ)} + n(θb − θ){ ψ (xi; θ)} + Op(1/ n). n i=1 n i=1 √ 1 Pn 2 By the CLT, n{n i=1 ψ(xi; θ)} →d Z ∼ Normal(0,EF [ψ(X; θ) ]).

1 Pn 0 0 By the WLLN, n i=1 ψ (xi; θ) →p EF [ψ (X; θ)].

Thus, by Slutsky’s theorem, √ 0 2 n(θb − θ) →d −Z/EF [ψ (X; θ)] ∼ Normal(0, σ ), where 2 2 0 2 2 σ = EF [ψ(X; θ) ]/EF [ψ (X; θ)] = EF [IF (X; TF ) ]

• NOTE: Proving Frechet` differentiability is not necessary. M-estimates of location Adaptively weighted

• Recall translation equivariance and symmetry implies ψ(x; t) = ψ(x − t) and ψ(−r) = −ψ(r).

• Express ψ(r) = ru(r) and let wi = u(xi − θ), then n n n X X X 0 = ψ(xi − θ) = (xi − θ)u(xi − θ) = wi {xi − θ} i=1 i=1 i=1 Pn wi xi ⇒ θc = i=1 Pn i=1 wi

The weights are determined by the data cloud. ••••••• • •

z }| { θb Heavily Downweighted SOME COMMON M-ESTIMATES OF LOCATION Huber’s M-estimate

   c r ≥ c  ψ(r) =  r | r | < c    −c r ≤ −c

• Given bound on the GES, it has maximum efficiency at the normal model. • MLE for the “least favorable distribution”, i.e. symmetric unimodal model with smallest within a “neighborhood” of the normal. LFD = Normal in the middle and double exponential in the tails. Tukey’s Bi-weight M-estimate (or bi-square)

 2  2  r    u(r) = 1 −  where a+ = max{0, a}  2   c +

• Linear near zero • Smooth (continuous second derivatives) • Strongly redescending to 0 • NOT AN MLE. CAUCHY MLE r/c ψ(r) = (1 + r2/c2)

NOT A STRONG REDESCENDER. COMPUTATIONS

• IRLS: Iterative Re-weighted Algorithm. Pn wi,kxi θb = i=1 k+1 Pn i=1 wi,k

where wi,k = u(xi − θbk) and θbo is any intial value.

• Re-weighted mean = One-step M-estimate Pn wi,o xi θb = i=1 , 1 Pn i=1 wi,o

where θbo is a preliminary robust estimate of location, such as the median. PROOF OF CONVERGENCE Sketch

Pn Pn b Pn b i=1 wi,kxi i=1 wi,k(xi − θk) i=1 ψ(xi − θk) θbk+1 − θbk = − θbk = = Pn Pn Pn b i=1 wi,k i=1 wi,k i=1 u(xi − θk)

Note: If θbk > θb, then θbk > θbk+1 and θbk < θb, then θbk < θbk+1.

Decreasing objective function

2 0 00 Let ρ(r) = ρo(r ) and suppose ρo(s) ≥ 0 and ρo(s) ≤ 0. Then n n X X ρ(xi − θbk+1) < ρ(xi − θbk). i=1 i=1 • Examples: – Mean ⇒ ρ (s) = s o √ – Median ⇒ ρo(s) = s

– Cauchy MLE ⇒ ρo(s) = log(1 + s). • Include redescending M-estimates of location. • Generalization of EM algorithm for mixture of normals. PROOF OF MONOTONE CONVERGENCE Sketch

Let ri,k = (xi − θbk). By Taylor series with remainder term, 1 ρ (r2 ) = ρ (r2 ) + (r2 − r2 )ρ0 (r2 ) + (r2 − r2 )2ρ00(r2 ) o i,k+1 o i,k i,k+1 i,k o i,k 2 i,k+1 i,k o i,∗ So, n n n X 2 X 2 X 2 2 0 2 ρo(ri,k+1) < ρo(ri,k) + (ri,k+1 − ri,k)ρo(ri,k) i=1 i=1 i=1 Now, 0 0 2 0 2 ru(r) = ψ(r) = ρ (r) = 2rρo(r ) and so ρo(r ) = u(r)/2. Also, 2 2 b b 2 b b b (ri,k+1 − ri,k) = (θk+1 − θk) − 2(θk+1 − θk)(xi − θk). Thus, n n 1 n n X 2 X 2 b b 2 X X 2 ρo(ri,k+1) < ρo(ri,k) − (θk+1 − θk) u(ri,k) < ρo(ri,k) i=1 i=1 2 i=1 i=1 • Slow but Sure ⇒ Switch to Newton-Raphson after a few iterations. PART 2 MORE ADVANCED CONCEPTS AND METHODS SCALE EQUIVARIANCE • M-estimates of location alone are not scale equivariant in general, i.e.

Xi → Xi + a ⇒ θb → θb + a location equivariant

Xi → b Xi 6⇒ θb → b θb not scale equivariant (Exceptions: the mean and median.) • Thus, the adaptive weights are not dependent on the spread of the data

••••••• •

z }| { Equally Downweighted | {z } • • ••• • • • SCALE STATISTICS: sn

Xi → bXi + a ⇒ sn → | b |sn • Sample standard deviation. v u u 1 n u X 2 sn = t (xi − x¯) n − 1 i=1 • MAD (or more approriately MADAM). Median Absolute Deviation About the Median

∗ sn = Median| xi − median | ∗ 2 sn = 1.4826 sn →p σ at Normal(µ, σ ) Example 2, 4, 5, 10, 12, 14, 200 • Median = 10 • Absolute Deviations: 8, 6, 5, 0, 2, 4, 190

• MADAM = 5 ⇒ sn = 7.413 • (Standard deviation = 72.7661) M-estimates of location with auxiliary scale

  n x − θ  c X  i  θ = arg min ρ   i=1  c sn 

or

  n x − θ  c X  i  θ solves: ψ   = 0 i=1  c sn 

• c: tuning constant

• sn: consistent for σ at normal Tuning an M-estimate • Given a ψ-function, define a class of M-estimates via

ψc(r) = ψ(r/c)

• Tukey’s Bi-weight.

 2  2 r  r  2 2   ψ(r) = r{(1 − r )+} ⇒ ψc(r) = 1 −   2  c  c +

• c → ∞ ⇒ ≈ Mean • c → 0 ⇒ Locally very unstable . Auxiliary scale. MORE THAN ONE OUTLIER • GES: measure of local robustness. • Measures of global robustness?

Xn = {X1,...,Xn}: n “good” data points

Ym = {Y1,...,Ym}: m “bad” data points

m Zn+m = Xn ∪ Ym: m-contaminated sample. m = n+m

Bias of a statistic: | T (Xn ∪ Ym) − T (Xn) | MAX-BIAS and THE BREAKDOWN POINT

• Max-Bias under m-contamination:

B(m; T, Xn) = sup | T (Xn ∪ Ym) − T (Xn) | Ym

• Finite sample contamination breakdown point: Donoho and Huber (1983)

∗ c(T ; Xn) = inf{m | B(m; T, Xn) = ∞}

• Other concepts of breakdown (e.g. replacement).

• The definition of BIAS can be modified so that BIAS → ∞ as T (Xn ∪ Ym) goes to the boundary of the parameter space. Example: If T represents scale, then choose

BIAS = | log{T (Xn ∪ Ym)} − log{T (Xn)} |. EXAMPLES

∗ 1 • Mean: c = n+1 ∗ • Median: c = 1/2 ∗ • α-trimmed mean: c = α ∗ • α-Windsorized mean: c = α • M-estimate of location with monotone and bounded ψ function:

∗ c = 1/2 PROOF (sketch of lower bound)

Let K = supr|ψ(r)|.

n+m n m X X X 0 = ψ(zi − Tn+m) = ψ(xi − Tn+m) + ψ(yi − Tn+m) i=1 i=1 i=1 n m X X | ψ(xi − Tn+m) | = | ψ(yi − Tn+m) | ≤ mK i=1 i=1

Breakdown occurs ⇒ | Tn+m | → ∞, say Tn+m → −∞ n n X X ⇒ | ψ(xi − Tn+m) | → | ψ(∞) | = nK i=1 i=1 ∗ Therefore, m ≥ n and so c ≥ 1/2.[] Population Version of Breakdown Point under contamination neighborhoods • Model Distribution: F • Contaminating Distribution: H

• -contaminated Distribution: F = (1 − )F + H • Max-Bias under -contamination:

B(; T,F ) = sup | T (F) − T (F ) | H • Breakdown Point: Hampel (1968) ∗(T ; F ) = inf{ | B(; T,F ) = ∞}

• Examples ∗ – Mean: T (F ) = EF (X) ⇒  (T ; F ) = 0 – Median: T (F ) = F −1(1/2) ⇒ ∗(T ; F ) = 1/2 ILLUSTRATION

GES ≈ ∂B(; T,F )/∂ |=0 ∗(T ; F ) = asymptote Heuristic Interpretation of Breakdown Point subject to debate

• Proportion of bad data a statistic can tolerate before becoming arbitrary or meaningless.

• If 1/2 the data is bad then one cannot distinguish between the good data and the bad data? ⇒  ≤ 1/2?

• Discussion: Davies and Gather (2005), Annals of Statistics. Example: Redescending M-estimates of location with fixed scale

  n | xi − t | X   T (n1, . . . , xn) = arg min ρ   t i=1 c

Breakdown point. Huber (1984). For bounded increasing ρ,

∗ 1 − A(Xn; c)/n c(T ; F ) = 2 − A(Xn; c)/n

Pn xi−t where A(Xn; c) = mint i=1 ρ( c ) and sup ρ(r) = 1.

• Breakdown point depends on Xn and c • ∗ : 0 → 1/2 as c : 0 → ∞

• For large c, T (n1, . . . , xn) ≈ Mean!! Explanation: Relationship between redescending M-estimates of location and kernel density estimates Chu, Glad, Godtliebsen and Marron (1998)

• Objective function: n   X xi − t ρ   i=1 c • Kernel density estimate: n   X xi − t fc(x) ∝ κ   i=1 h • Relationship: κ ∝ 1 − ρ and c = window width h

2 2 • Example: κ(r) = √1 exp−r /2 ρ(r) = 1 − exp−r /2 2π ⇒ Normal kernel Welsch’s M-estimate • Example: Epanechnikov kernel ⇒ skipped mean • “” less compact than “good” data ⇒ Breakdown will not occur. • Not true for monotonic M-estimates. PART 3 AND REGRESSION SETTING

• Data: (Yi, Xi) i = 1, . . . , n

– Yi ∈ < Response p – Xi ∈ < Predictors

• Predict: Y by X0β

0 • Residual for a given β: ri(β) = yi − xiβ

• M-estimates of regression – Generalizion of MLE for symmetric error term 0 – Yi = Xiβ + i where i are i.i.d. symmetric. M-ESTIMATES OF REGRESSION • Objective function approach: n X min ρ(ri(β)) β i=1 where ρ(r) = ρ(−r) and ρ ↑ for r ≥ 0. • M-estimating equation approach: n X ψ(ri(β))xi = 0 i=1 e.g. ψ(r) = ρ0(r) • Interpretation: Adaptively weighted least squares.

◦ Express ψ(r) = ru(r) and wi = u(ri(βc))

n n c X 0 −1 X β = [ wixixi] [ wiyixi] i=1 i=1 • Computations via IRLS algorithm. INFLUENCE FUNCTIONS FOR M-ESTIMATES OF REGRESSION

IF (y, x; T,F ) ∝ ψ(r) x • r = y − x0T (F ) • Due to residual: ψ(r) • Due to Design: x • GES = ∞, i.e. unbounded influence.

• Outlier 1: highly influential for any ψ function. • Outlier 2: highly influential for monotonic but not redescending ψ. BOUNDED INFLUENCE REGRESSION • GM-estimates (Generalized M-estimates). ◦ Mallows-type (Mallows, 1975): n X w(xi) ψ (ri(β)) xi = 0 i=1 Downweights outlying design points ( points), even if they are good leverage points. ◦ General form (c.g. Maronna and Yohai, 1981): n X ψ (xi, ri(β)) xi = 0 i=1 • Breakdown points. – M-estimates: ∗ = 0 – GM-estimates: ∗ ≤ 1/(p + 1) LATE 70’s - EARLY 80’s

• Open problem: Is high breakdown point regression possible?

• Yes. Repeated Median. Siegel (1982). – Not regression equivariate

• Regression equivariance: For a ∈ < and A nonsingular 0 −1 (Yi, Xi) → (aYi,A Xi) ⇒ βc → aA βc i.e. Yci → aYci • Open problem: Is high breakdown point equivariate regression possible?

• Yes. Least Median of Squares. Hampel (1984), Rousseeuw, (1984) Least Median of Squares (LMS) Hampel (1984), Rousseeuw, (1984)

2 min Median{ri(β) | i = 1, . . . , n} β

• Alternatively: minβ MAD{ ri(β) } • Breakdown point: ∗ = 1/2 • Location version: SHORTH (Princeton Robustness Study)

Midpoint of the SHORTest Half. LMS: Mid-line of the shortest strip containing 1/2 of the data.

√ √ • Problem: Not n-consistent, but only 3 n-consistent √ n|| βc − β || →p ∞ √ 3 n|| βc − β || = Op(1) • Not locally stable. e.g. Example is pure noise. S-ESTIMATES OF REGRESSION Rousseeuw and Yohai (1984)

• For S(·), an estimate of scale (about zero): minβ S(ri(β)) • S = MAD ⇒ LMS • S = sample standard deviation about 0 ⇒ Least Squares • Bounded monotonic M-estimates of scale (about zero): n X χ (| ri |/s) = 0 i=1 for χ ↑, and bounded above and below. Alternatively, n 1 X ρ (| ri |/s) =  n i=1 for ρ ↑, and 0 ≤ ρ ≤ 1 • For LMS: ρ : 0 - 1 jump function and  = 1/2 • Breakdown point: ∗ = min(, 1 − ) S-estimates of Regression √ • n- consistent and Asymptotically normal. • Trade-off: Higher breakdown point ⇒ Lower efficiency and higher residual gross error sensivity for Normal errors.

• One Resolution: One-step M-estimate via IRLS. (Note: R, S-plus, Matlab.) M-ESTIMATES OF REGRESSION WITH GENERAL SCALE Martin, Yohai, and Zamar (1989)

 0  n | yi − x β | X  i  min ρ   β i=1 c sn

• Parameter of interest: β

• Scale statistic: sn (consistent for σ at normal errors)

• Tuning constant: c

• Monotonic bounded ρ ⇒ redescending M-estimate

• High Breakdown Point Examples – LMS and S-estimates – MM-estimates, Yohai (1987). – CM-estimates, Mendes and Tyler (1994).  0  n | yi − x β | X  i  sn .vs. ρ   i=1 c sn

• Each β gives one curve

• S minimizes horizonally • M minimizes vertically MM-ESTIMATES OF REGRESSION Default robust regression estimate in S-plus (also SAS?) • Given a preliminary estimate of regression with breakdown point 1/2. Usually an S-estimate of regression.

• Compute a monotone M-estimate of scale about zero for the residuals. sn: Usually from the S-estimate.

• Compute an M-estimate of regression using the scale statistics sn and with any desired tuning constant c.

• Tune to obtain reasonable ARE and residual GES.

• Breakdown point: ∗ = 1/2 TUNING

• High breakdown point S-estimates are badly tuned M-estimates. • Tuning the S-estimates affects its breakdown point. • MM-estimates can be tuned without affecting the breakdown point COMPUTATIONAL ISSUES

• All known high breakdown point regression estimates are computationally inten- sive. Non-convex optimization problem.

• Approximate or stochastic algorithms.

• Random Elemental Subset Selection. Rousseeuw (1984).  n  – Consider exact fit of lines for p of the data points ⇒   such lines.  p  – Optimize the criterion over such lines. – For large n and p, randomly sample from such lines so that there is a good chance, say e.g. 95% chance, that one of the elemental subsets will contain only good data even if half the data is contaminated. MULTIVARIATE DATA Robust Multivariate Location and Estimates Parallel developments • Multivariate M-estimates. Maronna (1976). Huber (1977). ◦ Adaptively weighted mean and . ◦ IRLS algorithms. ◦ Breakdown point: ∗ ≤ 1/(d + 1), where d is the dimension of the data.

• Minimum Volume Ellipsoid Estimates. Rousseeuw (1985). ◦ MVE is a multivariate version of LMS.

• Multivariate S-estimates. Davies (1987), Lopuhuaa¨ (1989)

• Multivariate CM-estimates. Kent and Tyler (1996).

• Multivariate MM-estimates. Tatsuoka and Tyler (2000). Tyler (2002). MULTIVARIATE DATA

• d-dimensional data set: X1,... Xn

• Classical :

1 Pn – Sample Mean Vector: X = n i=1 Xi – Sample Variance-Covarinance Matrix 1 n X T Sn = (Xi − X)(Xi − X) = {sij} n i=1 where sii = sample variance, sij = sample covariance.

• Maximum likelihood estimates under normality Sample Mean and Covariance

● 6.0 ●

● ●

● ● ● ●

5.5 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● 5.0 ● ● ●

● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●

4.0 ●

3.6 3.8 4.0 4.2 4.4 4.6

Hertzsprung−Russell Galaxy Data (Rousseeuw) Visualization ⇒ Ellipse containing half the data.

T −1 (X − X) Sn (X − X) ≤ c 2 T −1 Mahalanobis distances: di = (Xi − X) Sn (Xi − X)

● ● ● 3.0 ● 2.5

● 2.0 ●

● ● ●

MahDis ●

1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 0.5 ● ● ● ● ●

0 10 20 30 40

Index

−1  T −1  Mahalanobis angles: θi,j = Cos (Xi − X) Sn (Xj − X)/(didj) 1.0 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ●● ●● ● PC2 ●

● ● −0.5 ● ● ● ● −1.0

● ● ● ● ● ●● ● ● ● ● ●●●●●●●●● ●● ● ● ● ●●● ● ● ●●● ●●●● ● ●

−1.0 −0.5 0.0 0.5 1.0

PC1

Principal Components: Do not always reveal outliers. ENGINEERING APPLICATIONS

• Noisy signal arrays, radar clutter. Test for signal, i.e. detector, based on Mahalanobis angle be- tween signal and observed data.

• Hyperspectral Data. Adaptive cosine = Mahalanobis angle.

• BIG DATA

• Can not clean data via observations. Need automatic methods. ROBUST MULTIVARIATE ESTIMATES Location and Scatter (Pseudo-Covariance)

• Trimmed versions: complicated

• Weighted mean and covariance matrices

Pn wi Xi 1 n i=1 d X T µd = Σ = wi (Xi − µd)(Xi − µd) Pn i=1 i=1 wi n

−1 2 T d where wi = u(di,o) and di,o = (Xi − µdo) Σo (Xi − µdo)

d and with µdo and Σo being initial estimates, e.g. → Sample mean vector and . → High breakdown point estimates. MULTIVARIATE M-ESTIMATES

• MLE type for elliptically symmetric distributions

• Adaptively weighted mean and covariances

Pn wi Xi 1 n i=1 d X T µd = Σ = wi (Xi − µd)(Xi − µd) Pn i=1 i=1 wi n

−1 2 T d d • wi = u(di) and di = (Xi − µd) Σ (Xi − µ)

• Implicit equations PROPERTIES OF M-ESTIMATES

• Root-n consistent and asymptotically normal.

• Can be tuned to have desirable properties: – high efficiency over a broad class of distributions – smooth and bounded influence function

• Computationally simple (IRLS)

• Breakdown point: ∗ ≤ 1/(d + 1) HIGH BREAKDOWN POINT ESTIMATES

• MVE: Minimum Volume Ellipsoid ⇒ ∗ = 0.5 Center and scatter matrix for smallest ellipse covering at least half of the data.

• MCD, S-estimates, MM-estimates, CM-estimates ⇒ ∗ = 0.5

• Reweighted versions based on high breakdown start.

• Computationally intensive: Approximate/probabilistic algorithms. Elemental Subset Approach. Sample Mean and Covariance

● 6.0 ●

● ●

● ● ● ●

5.5 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● 5.0 ● ● ●

● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●

4.0 ●

3.6 3.8 4.0 4.2 4.4 4.6

Hertzsprung−Russell Galaxy Data (Rousseeuw) Add MVE

● 6.0 ●

● ●

● ● ● ●

5.5 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● 5.0 ● ● ●

● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●

4.0 ●

3.6 3.8 4.0 4.2 4.4 4.6

Add Cauchy MLE

● 6.0 ●

● ●

● ● ● ●

5.5 ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● 5.0 ● ● ●

● ● ● ● ● ● 4.5 ● ● ● ● ● ● ● ● ●

4.0 ●

3.6 3.8 4.0 4.2 4.4 4.6

ELLIPTICAL DISTRIBUTIONS and AFFINE EQUIVARIANCE SPHERICALLY SYMMETRIC DISTRIBUTIONS

Z = RU where R ⊥ U, R > 0,

and U ∼d Uniform on the unit sphere.

Density: f(z) = g(z0z)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● 1.0 ●● ●● ●●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●● ●●●●●●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●●● ●●●● ●●●● ●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●● ●●●●●● ●●●● ●●● ●●●●●● ●●●●●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●●●● ●●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●● ●●●● ●●● ●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●● ●● ●●● ●●● ●●●●●●●●● ●●●●●●●●● ●●● ●●● ●● ●●● ●●●●●●● ●●●●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●● ●●●●● ●●● ●● 0.5 ●● ●● ●●●● ●●●● ●● ●● ●● ●●● ●●●● ●●●● ●●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●●●● ●●●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●● ●●●●● ●●● ●● ●● ● ●● ●● ●●●● ●●●● ●● ●● ● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ● ●● ●● ●●● ●●● ●● ●● ● ● ●● ●● ●●● ●●● ●● ●● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ● ●● ●● ●●●●●●●● ●●●●●●●● ●● ●● ● ● ● ● ● ●● ●●●●● ●●●●● ●● ● ● ● ● ● ● ●● ●●●● ●●●● ●● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ● ● ● ●● ●●●● ●●●● ●● ● ● ● ● ● ● ●● ●●●●● ●●●●● ●● ● ● ● ● ● ●● ●● ●●●●●●● ●●●●●●● ●● ●● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ● ● ● ●● ●●● ●●● ●● ● ● ● ●● ●● ●●● ●●● ●● ●● ● ● ●● ●● ●●● ●●● ●● ●● ● ● ●● ●● ●●●● ●●●● ●● ●● ● ● ●● ●● ●●●● ●●●● ●● ●● ● ●● ●● ●●● ●●●●● ●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●●●● ●●●●●●●● ●●● ●● ●● ●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●● ●●●● ●●●● ●● ●● ●● ●●● ●●●● ●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●●● ●●●●●● ●●● ●● ●● ●●● ●●●●●●● ●●●●●●● ●●● ●● −0.5 ●● ●● ●●●●● ●●●●● ●● ●● ●● ●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●● ●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●● ●●● ●●●● ●●●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●●●●●● ●●●●●● ●●●● ●●● ●●●●●● ●●●●●● ●●● ●●●● ●●●●●●● ●●●●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●●● ●●●● ●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −1.0

−1.0 −0.5 0.0 0.5 1.0

x ELLIPTICAL SYMMETRIC DISTRIBUTIONS

X = AZ + µ ∼ E(µ, Γ; g), where Γ = AA0

0 ! ⇒ f(x) = |Γ|−1/2g (x − µ) Γ−1(x − µ)

2.5 ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●●●●●●● ●●●●● ●●●●●●●● ●●●● ●●●●●●● ●●●● ●●●●●●● ●●● ●●●●●● ●●● ●●●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●●●●●●● ●●●●●●●● ●● ●●●●● ●●●●●●●● ●●●●●●● ●● ●●●●● ●●●●●●● ●●●●● ●● ●●●●● ●●●●●● ●●●● ●● ●●●● ●●●●●● ●●● ● 2.0 ●● ●●● ●● ● ●●● ●●●●● ●●● ● ●●● ●●●●● ●● ● ●●● ●●●●● ●●●●●●●●●●●●●● ●● ● ●●● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●● ● ●●● ●●●● ●●●●●●●●● ●●●●●●● ●● ● ●●● ●●●● ●●●●●●● ●●●●● ● ● ●●● ●●●● ●●●●●● ●●●● ● ● ●●● ●●●● ●●●●●● ●●● ● ● ●●● ●●●● ●●●●● ●●● ● ● ●●● ●●●● ●●●●● ●● ● ● ●●● ●●●● ●●●●● ●● ● ● ●●● ●●●● ●●●● ●● ● ● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●● ●●●● ●●●●●●●●●● ●●●●●●● ● ● ● ●●● ●●● ●●●● ●●●●●●●● ●●●● ● ● ● ●●● ●●●● ●●●● ●●●●●●● ●●● ● ● ● 1.5 ●● ●● ●●● ●●●● ●● ● ● ● ●●● ●●● ●●● ●●●●●● ●● ● ● ● ●●● ●●● ●●●● ●●●●● ●● ● ● ● ●●● ●●● ●●●● ●●●● ● ● ● ● ●●● ●●● ●●● ●●●● ● ● ● ● ●●● ●●● ●●● ●●●● ● ● ● ● ●●● ●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●● ●●● ●●● ●●● ●●●●●●● ●●●● ● ● ● ●● ●● ●● ●●● ●●● ●●●●● ●● ● ● ●● ●● ●● ●●● ●●● ●●● ●●●●● ● ● ●● ●● ●● ●● ●● ●●● ●●● ●●●● ● ● ●● ●● ●● ●● ●● ●● ●●● ●●●● ● ●● ●● ●● ●● ●● ●● ●● ●●● ●●● ● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●● ●● ●● ●● 1.0 ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ●●● ●●● ●● ●● ●● ●● ●● ●● ●● ● ●●● ●●● ●● ●● ●● ●● ●● ●● ● ● ●●●● ●●● ●●● ●● ●● ●● ●● ●● ● ● ●●●● ●●● ●●● ●● ●● ●● ●● ●● ● ●● ●●●●● ●●● ●●● ●●● ●● Eliplot()[,2] ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ●●●●●● ●●●● ●●● ●● ●● ●● ● ● ● ●●●●●● ●●●●●●●●●● ●●●● ●●● ●●● ●●● ● ● ● ● ●●●●●●●●●●●●● ●●●● ●●● ●●● ●●● ● ● ● ● ●●●● ●●● ●●● ●●● ● ● ● ● ●●●● ●●●● ●●● ●●● ● ● ● ● ●●●●● ●●● ●● ●●● ● ● ● ●● ●●●●● ●●●● ●●● ●●● ● ● ● ●● ●●●●●● ●●●● ●●● ●●● ● ● ● ●●● ●●●●●●● ●●●● ●●● ●●● ● ● ● ●●●● ●●●●●●●● ●●●● ●●●● ●●● ● ● ● ●●●●●●● ●●●●●●●●●● ●●●● ●●● ●●● 0.5 ● ● ● ●●●●●●●● ●●●●●●●●● ●● ●● ●● ● ● ● ●●●●●●●●●●●● ●●●● ●●●● ●●● ● ● ●● ●●●● ●●●● ●●● ● ● ●● ●●●●● ●●●● ●●● ● ● ●● ●●●●● ●●●● ●●● ● ● ●●● ●●●●● ●●●● ●●● ● ● ●●●● ●●●●●● ●●●● ●●● ● ● ●●●● ●●●●●●● ●●●● ●●● ● ●● ●●●●●● ●●●●●●●● ●●●● ●●● ● ●● ●●●●●●●●● ●●●●●●●●●●● ●●●● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ● ●● ●●●●● ●●● ● ●●● ●●●●● ●●● ● ●●● ●●●●●● ●●● ●● ●●●● ●●●●●● ●●●●● ●● ●●●● ●●●●●●● ●●●●● ●● ●●●●● ●●●●●●● ●●●●● 0.0 ● ● ●● ● ●● ●●●●●●● ●●●●●●●●● ●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●●●●●● ●●● ●●●●●● ●●●● ●●●●●●● ●●●● ●●●●●●●● ●●●● ●●●●●●●● ●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −0.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

Eliplot()[,1] AFFINE EQUIVARIANCE

• Parameters of elliptical distributions: X ∼ E(µ, Γ; g) ⇒ X∗ = BX + b ∼ E(µ∗, Γ∗; g) where µ∗ = Bµ + b and Γ∗ = BΓB0

• Sample version: ∗ Xi → Xi = BX + b, i = 1, . . . n ∗ d d ∗ d 0 ⇒ µd → µd = Bµd + b and V → V = BVB

• Examples – Mean vector and Covariance matrix – M-estimates – MVE ELLIPTICAL SYMMETRIC DISTRIBUTIONS

0 ! f(x; µ, Γ) = |Γ|−1/2g (x − µ) Γ−1(x − µ) .

• If second moments exist, shape matrix Γ ∝ covariance matrix. • For any affine equivariant scatter functional: Σ(F ) ∝ Γ • That is, any affine equivariant scatter statistic estimates a matrix proportional to the shape matrix Γ.

NON-ELLIPTICAL DISTRIBUTIONS

• Difference scatter matrices estimate different population quantities. • Analogy: For non-symmetric distributions: population mean 6= population median. OTHER TOPICS

Projection-based multivariate approaches

• Tukey’s data depth. (or half-space depth) Tukey (1974).

• Stahel-Donoho’s Estimate. Stahel (1981). Donoho 1982.

• Projection-estimates. Maronna, Stahel and Yohai (1994). Tyler (1996).

• Computationally Intensive. TUKEY’S DEPTH PART 5 ROBUSTNESS AND COMPUTER VISION

• Independent development of robust statistics within the com- puter vision/image understanding community.

• Hough transform: ρ = xcos(θ) + ysin(θ) • y = a + bx where a = ρ/sin(θ) and b = −cos(θ)/sin(θ) • That is, intersection refers to the line connecting two points. • Hough: Estimation in Data Space to Clustering in Feature Space Find centers of the clusters • Terminology: – Feature Space = Parameter Space – Accumulator = Elemental Fit • Computation: RANSAC () – Randomly choose a subset from the Accumulator. (Random elemental fits.) – Check to see how many data points are within a fixed neigh- borhood are in a neighborhood. Alternative Formulation of Hough Transform/RANSAC

  1, | r | > R P ρ(r ) should be small, where ρ(r) =  i   0, | r | ≤ R

That is, a redescending M-estimate of regression with known scale. Note: Relationship to LMS. Vision: Need e.g. 90% breakdown point, i.e. tolerate 90% bad data Defintion of Residuals? Hough transform approach does not distinguish between: • Regression line for regressing Y on X. • Regression line for regressing X on Y. • Orthogonal regression line. (Note: Small stochastic errors ⇒ little difference.) GENERAL PARADIGM • Line in 2-D: Exact fit to all pairs • Quadratic in 2-D: Exact fit to all triples

• Conic Sections: Ellipse Fitting (x − µ)0A(x − µ) = 1 0 2 2 ◦ Linearizing: Let x∗ = (x1, x2, x1, x2, x1x2, 1) 0 ◦ Ellipse: a x∗ = 0 ◦ Exact fit to all subsets of size 5. Hyperboloid Fitting in 4-D • Epipolar Geometry: Exact fit to ◦ 5 pairs: Calibrated cameras (location, rotation, scale). ◦ 8 pairs: Uncalibrated cameras (focal point, image plane). ◦ Hyperboloid surface in 4-D. • Applications: 2D → 3D, Motion. • OUTLIERS = MISMATCHES