<<

MINIMUM SQUARED ERROR ESTIMATION

by

Victor Charles Drastik

Submitted for the degree of Doctor of Philosophy, School of Mathematics, University of New South Wales.

August, 1983 2.

Abstract

Let x1, x2, ... ,Xn be independent and identically distributed random variables, each with distribution depending on one or more para­ meters. The random vector ~ usually depends on a parameter e in a nonlinear fashion. We define a linearising transformation for e to be a function g(X) such that the dependence of g(X) on e is linear in the sense that E g(X) is proportional to e

If we can find m · such transformations g1, g2 , .. . ,gm we may construct an of e as a compound linearising transformation

A m e = . l c. g. ( X) 1 =1 1 1 -

The optimal estimator of this form may then be found by minimising the mean squared error of e with respect to the constants c1, c2, ... ,cm. We will call the combination of these two stages the Method of Minimum Mean Squared Error Estimation.

In this thesis, we discuss some approaches to finding linearising transformations. The resulting techniques are applied to several parametric problems, including a computer simulation study of of scale parameters, in which a modification to the MLE is revealed to be almost optimal. Finally, we discuss distribution-free estimation of the centre of a syrrmetric distribution. I~ a computer simulation study, the estimators resulting from the MMSEE approach are found to be superior to the best estimators from the Princeton Robustness Study. 3.

Acknowledgement

I would like to express my gratitude to Dr. Peter Cooke, my supervisor and friend, whose guidance and encouragement have been of immeasurable value to me in the course of my research. 4.

Contents Page

Title Page 1

Abstract 2

Acknowledgement 3

Table of Contents 4

Notation 5

Chapter I In troduc ti on 8

Chapter II Theoretical Development 14

Chapter II I Scale Parameter Simulation Results 39

Chapter IV Distribution-free MMSEE 48

Appendix 74

Bibliography 76 5.

Notation exp(x), ln(x) natural exponent and natural logarithm of x r(n) gamma function of n , defined as 00 n-1 -u r (n ) = J u e du for n > 0 0

B(m,n) beta function of m and n , defined as 1 B(m,n) = J um-1( 1-u )n-1 du= r(m)r(n) for 0 r(m+n) m > 0 , n > 0 n'. factori a 1 n , defined as n'. = r(n+l) for n -> 0 . When n is integral, n'. = n(n-1) ... 2.1

binomial coefficient of n and r , defined as (n) n'. f 0 r = r'. (n-r} '. or < r < n

[xJ integer part of x , defined as the largest integer less than or equal to x

the inverse function of h , defined as the value

of x such that y = h(x)

-X , A vector x , matrix A T X , AT transpose of ~, transpose of A J , l , n integration, summation, product operators a d ax , dx partial derivative, differentiation operators (f1 (x) = 1x f(x))

E , Var, Cov, MSE expectation, , covariance, mean squared error operators 6.

x. observed value of X. 1 1 x(l) ~ x(2) < ~ \n) order statistics based on x1, x2, ... ,Xn x(l) ~ x(2) < ordered observations

sample mean, defined as = x x ln . 1 x.1 1= 1

1 n 2 s2 sample variance, defined as s2 = - l (X.-X) n-1 . 1 1= 1 F(xl~) cumulative distribution function of the random variable X, depending on the parameter vector e f(xl~) probability density function of the random

variable X ~ depending on e 2 N(µ,o) Nonnal random variable with mean µ and variance o2

U[a,bJ Unifonn random variable over the interval [a,bJ a, a, y location, scale and shape parameters.

µ, a mean, e,"' -e estimators of e

C* quantity c distinguished in some way (usually optimal value of c) n sample size

LT Linearising Transfonnation

MLE Maximum Likelihood Estimation or Estimator or Estimate 7.

MSE Mean Squared Error

MMSEE Minimum MSE Estimation or Estjmator or Estimate

V, E, ~ for all, belongs to, approximately equals is distributed as

Character type use

Lower case Greek scalars, parameters

(Examples : w, µ)

Lower case Roman vectors, constants or coefficients, observations, functions

(Examples: C, a, X, f)

Upper case Roman matrices, random variables, functions (Examples: A-1 , -X, F(x I~) )

Upper case Greek spaces, operators

(Examples: 0, l: ) 8.

Chapter I: Introduction

In the classical problem of estimation of a single parameter e, the estimator is a function of independent and identically distributed

random variables x1, x2, ... ,Xn each with distribution depending on e (and possibly other parameters). We would like to find an estimator ,.. e which is 11 close 11 to e . An ideal (or most concentrated) estimator is one which has its probability mass concentrated as closely as possible about the true parameter value e . Unfortunately, although the property of most concentrated is highly desirable, ideal estimators seldom exist. There are just too many possible estimators for any one of them to be most concentrated, so the concept of ideal estimation is not usable in practice. We need to find a criterion of closeness which also leads to estimators which are highly concentrated about e .

The Mean Squared Error (MSE) of an estimator e(X) of the para­ meter e based on the sample ~ = (X 1, x2, ... ,Xn) is defined to be the average squared deviation of e from e and can be written as the ,.. variance of e plus the square of its bias; that is,

MSE(e) = E{e(~)-e} 2 = Var(e) + {E(e-e)}2

The principal advantage of the MSE criterion is that it severely

A penalises large deviations of e from e , that is, the frequency of large errors is greatly reduced. Another advantage is that consistency in MSE implies asymptotic unbiasedness and consistency in probability. Asymptotically the estimator is concentrated arbitrarily closely to e (but perhaps not optimally so). 9.

Let 0 represent the set of all possible estimators e of e.

A A uniformly MSE-optimal estimator among all members of 0 cannot exist because 0 is too large. It includes estimators such as 8(X)-- = 0 V X which are extremely prejudiced in favour of a particular e value. Therefore an estimator which has uniformly smallest MSE would necessarily have zero MSE for all e . This is clearly impossible except for trivial cases. We must therefore restrict our attention to some suitable subset of 0. The estimators in such a subset should possess some desirable property, usually by satisfying a constraint equation. The two most corrmon constraints are

(1) unbiasedness, with constraint equation

E(e) = e V0E0 and (2) invariance, expressed as

location: 8( X+c) = 0(X) + C scale e(cX) = ea( X) 1 V -X , V c ER

The MSE-optimal estimators in the first subset are Uniformly Minimum Variance Unbiased Estimators and in the second are Pitman [16J estimators of location and scale.

In this thesis we consider estimators in a third subset of 0 that defined by the constraint

m a(X) = l c. g.(X) 1=. 1 1 1 - where {ci} is a set of undetermined coefficients and {gi} is a set of functions of the sample data. This constraint has the advantage 10.

that, unlike the other constraints, it immediately gives us a functional fonn for the estimator. Like the others it is not a strong restriction and often the three methods develop fonnulae which differ only in constants. Another advantage is that once the functions {gi} are chosen, the selection of the MSE-optimal member of the set is usually a matter of some elementary calculus, whereas the optimal estimators for the other methods are not at all obvious.

The Method of Minimum Mean Squared Error Estimation (MMSEE) is one of a class of methods of optimal algorithm selection generally known as "methods of undetennined coefficients" (see, for example, Gerald, [BJ). The principle behind these methods is that first we choose a class of formulae that are likely to be effective in solving a given problem, then we choose a particular member of the class by referring to some appropriate criterion of . For example, a problem in numerical analysis is to find the area under a curve between two given points by approximate integration. The method of undetennined coef­ ficients used by Gauss to solve this problem is as follows: first we select a class of fonnulae by considering the characteristics of the problem. We can write the value of the integral as

b I = f f(x)dx = (b-a)f a where f is the average value of f over the domain [a,bJ. One reasonable fonnula to approximate an average function value is a linear combination of several function values. Thus

" m I = w. f(x .) . l 1 , 1 1= is an approximation to I , where {xi} is a set of points in [a,bJ and {wi} is a set of undetermined coefficients. Next, we select a 11. criterion for deciding which "I is the best one for our purposes. There are several reasonable criteria available, but the one usually chosen is that I should be exact for as many low-order polynomials as possible. Since we have 2m undetermined quantities, it must be " possible to choose the {wi} and {xi} such that I correctly evaluates the integrals of all polynomials of degree less than or equal to 2m-1 •

Thus, in general, in order to select a class of formulae, we apply some prior knowledge about what sort of formula is.likely to be suit­ able. To determine the values of arbitrary constants we apply a cri­ terion which is an effective .discriminant between good and bad members of the cl ass.

MMSEE is not an entirely new idea: Kendall and Stuart [12] find the MMSEE of the Normal variance cr2 based on the sample variance s2 and Markowitz [15J finds the MMSEE of the Normal standard deviation a based on the sample standard deviation S . Cox and Hinkley [5] discuss the idea of "u:ti..mailon wli:h..in a nuruc:ted chu,.6" and say " ••• ,ln many c.ontext.6 :the CJLlteJLlon 06 u.nb,ltUiednu.6 .u., no.t po.JLtleui..alr.i.y ne­ levan.t; one ILetUion .u., :tha..t an utima.t.e 06 .6ma..U. b-i.a1., a.nd .6ma..U. va­ lL,la.ne,e w..iii. 601t. mo.6.t pWtpo.6U be p1t.e6ell.a.ble .to one wLth no b,ltU, and app!Leua.ble va.lL,la.ne,e. Th.u.i na..twr.a.liy !LO.,U,U :the po1,1,,lbilliy 06 U6,lng mean 1,qu.aJLR..d eMOIL a6 .the CJU..te.lLion .to be mbu.m.l6ed ••• 1,ometimu U6e6ul u:ti..ma.tu can be ob:ta.,lned by rnin,i,mi,t,ing mean i,qu.a.Jte.d eMOIL w-i..:th,ln 1,ome no..twr..al 6arni.i.tJ o 6 utima..tu de.teJunlned, 601t. example, by inva.lLianc.e a.JLgumettt..6." 12.

Kendall and Stuart [13] conclude that having small MSE is a desirable property for an estimator and present the following example: given an estimator T which is unbiased for a parameter 0 , we easily find that the multiple aT of T which estimates 0 with smallest MSE is a* T, where

(1 * = V = o2 = Var(T) . a 0 and

Kendall and Stuart comment that "In gene.lllli., V (a.nd .t.he1te6011..e a* )

.i.1, a. 6unctlon o 6 0 , .6 o a * T .i.l, no.t. a. .6 :ta.fu:tic. U6 able 6011.. utima.:tion." This is only true if no estimate for o is available. However, Thompson [22] estimates a * by replacing all unknowns by estimates and shows that a* T can sometimes have larger MSE than T itself for moderate values of V . Thompson lowers the MSE of the minimum variance unbiased linear estimator of 0 by shrinking it toward some 11 natural". origin 00 in the parameter space. Unfortunately this does not result in a superior estimator, but merely buys decreased MSE near the "natural" origin at the cost of increased MSE elsewhere. Thompson concedes this point, but defends the method of toward a natura 1 origin 00 because ".. • ( 7) we be.Ue.ve 00 .i.1, c.lo.6e .t.o .t.he .t.Jw.e value o 6 0 , 011.. ( 2) we nea.11.. .t.ha..t. 00 ma.y be nea.11.. .t.he tJw.e value 06 0 ; Le . .6ometlung ba.d ha.ppen6 i6 0 = 00 a.nd we do n.o.t. know about U."

This thesis is concerned with developing a two-stage method of estimation, which belongs to the class of optimal algorithm selection procedures known as methods of undetermined coefficients. In the first stage we choose, by some reasonable method, a subset of the set of all possible estimators and in the second stage we select from this subset an estimator which has smallest MSE among all members of the subset. 13.

For this reason, we have called the method the Method of Minimum Mean Squared Error Estimation.

The inspiration for MMSEE came from the way in which we usually attack a new parametric estimation problem: we derive estimators using several corrmon estimation methods such as the Methods of Moments, Maximum Likelihood, etc. We then compare the MSE's of these estimators and select the estimator which has smallest MSE overall. These estimation methods may have desirable features, but this does not guarantee that the resulting estimator will have small MSE. The idea behind the Method of MMSEE is that we should construct estimators which are explicitly designed to have small MSE. 14.

Chapter II: Theoretical Development

The first stage of the Method of MMSEE is concerned with choosing an appropriate class of estimators within which to minimise MSE. An early solution to a problem of this type is given by Kendall and Stuart [12], where the MMSEE for the variance o2 of a Nonnal population is sought among the class of multiples as 2 of the sample variance. The MSE-optimal value of a is found to be (n-1)/(n+l) . Markowitz [15] solves the corresponding problem for the standard deviation o of a Normal population in an analogous way - by searching in the class of multiples as of the sample standard deviation. Details of the cal­ culations are as follows: let cr = as be an estimator of o. Then

MSE(o) = E(aS-o) 2 and so

a MSE(cr) = E2(aS-o)S aa ·

Setting the derivative to zero and solving for a , we obtain the optimal value

2 o

/n-1 r ( ~) = ./'l r (n~ 13

In the first case, the data were not linearly related to the para­ meter of interest o2 , since EX(i) - µ = o wi where X(i) is the i th order statistic. In the second case the observations were linearly 15. related to a, but a linear combination of order statistics was con­ sidered to be too difficult to use and a simpler transformation was found.

In most problems, the data X depend on the parameters in a nonlinear fashion. A linearising transformation (LT) for er is a function g ( X) such that the dependence of g ( X) on is r - r - er linear in the sense that

where the constant kr is nonzero and does not depend on er. An unbiased estimator of er is a LT with kr identically 1.

The relationship between the data X and the parameters ~ can be examined using the Probability Integral Transform (PIT) : if X is a random variable with continuous distribution function F(xl~) , then U = F(XI~) has the standard uniform distribution and we write U - U[O,lJ . This provides us with a convenient way of generating variates from any given distribution. If Ui - U[O,lJ, then x. = F- 1(U-le) is a random variable with distribution function F(xl_e); 1 1 - that is, each Xi can be related through ~ and F to a hypothetical Ui .

Initially we will assume that we are working with a one-parameter problem; that is, all but one element of the parameter vector e are known and we wish to develop an estimator for that element.

For the usual location and scale model, the distribution function and density function can be written

F(xlµ,cr) = F(x~µ) ; f(xlµ,cr) = ~ f(x~µ)

From the PIT we can write the generating equation of Xi as 16.

where U. - U[O,lJ and so 1

X. = µ + o F- l ( U. ) (1) 1 1

Given that µ is known, it is easy to show that the Maximum Likelihood Estimator (MLE) o of a is the solution of

n - f' (~) " 1 a = -n 1 (X;-11) (2) i=l f rI-µi

Now, from (1) we have x.-µ --1 = -a F- 1(Ui) a" a" and so letting a r = - and Wi = F-l( Ui) ( 3) a" we can rewrite (2) as

1 n - f'{r Wi) - l f( r W.) r Wi - 1 = 0 • n i=l 1 This equation may be solved for r, which is an implicit function of n and U = (Ul' u2, .•• ,Un) , say

~ = h(~, n)

Then, from (3) we have

" 1 a = r a= ah(~, n) .

Thus the MLE of a separates into the product of the unknown parameter a and a function of n standard uniform variates. 17.

A similar separation is observed with the

00 J a L{µ,a)da 0 = ( 4) a 00 J 1 L{µ,a)do 0

where L(µ,a) is the likelihood function. From (1), we have

X.-µ --1 = s w. , z 1

Changing the dummy variable in (4) from a to z and transforming to s , we may write a in the form

00 n J s n-3 lT f(s Wi)ds 0 i=l a = a 00 n J s n-2 lT f( s Wi) ds 0 i=l = a h(U, n)

The separation is not always into multiplicative parts - the Bayes estimator for the location parameter a in the model F(xla) = F(x-a) can be resolved into the form

a = a+ k(U, n) .

Suppose we now consider the general location, scale and shape parameter model

1 F(xla , B, y} = F( (x8a)y ) .

It is easy to show, using the above ideas, that it is possible to write the estimators of a,S and y derived by the methods of 18.

Maximum Likelihood, Bayes, Pitman, Minimum Distance (for distance

function d(e) = 1~11 Fcx{i)) - F(X(i)le)/2 ) and indeed any estimation method which involves using the parameters and data together only in the form {{Xi-a)/S} 11Y,. as a LT

g{X) = e h{U, n) + k{U,n) (5) where g{ X) is a,A AB or y;A e is a, B or y; neither h nor k depend on e; Ek{U,n) = 0 and E h{U,n) ; 0 The preceding examples suggest that usually either k = 0 or h = 1 .

As a further example, consider the usual linear model

y = XB+E - - - - where E{E) = 0 and X is the n x p design matrix. The estimator for -B is "*B = {XT !)-1 XT y

= {x1 !) - l XT (! B + E )

B + (XT x)-1 xT E = - - - - - . The Least Squares estimator is not the only LT for -B . Consider the trans fonnati on

A B = A y Now,

A n Br = l arj Yj j=l n n = .l arj ! X •. B- + I a . E· J=l i=l Jl , j=l rJ J n n n = + l a . X •• + Br .l arj xjr ! B-1 Jl l arj Ej . J=l 1=. 1 j=l rJ j=l ii=r 19.

Hence, for B to be a LT for B, we must have the following p2 conditions on the matrix A :

n l a . X. r = 1,2, ... ,p j=l rJ Jr

n l a . X •• = 0 ; , r = 1,2, ... ,p r 1 i j=l rJ Jl

Linearising transformations may be thought of as crude estimators, the building blocks from which one can construct an optimal estimator. No one method of finding LT's always works well and it may be necessary to try more than one of the ideas below.

(A) Intuition : this method involves making an educated guess based on our experience with similar problems; for example, it is reasonable to try a linear combination of two or more order statistics to estimate a location parameter.

(B) Generalisation of other estimators : in most cases estimators for a parameter would already be known; for example, estimators, maximum likelihood estimators, minimum distance estimators, minimum chi-square estimators, etc. We consider them singly or together to see whether they suggest a functional form which we can use as a LT, or we can try to extend the estimators in some logical way to allow a more flexible transfonnation.

Consider the problem of estimation in the Uniform distribution U[O,e] . The moment estimator for e is twice the sample mean and the MLE is the largest order statistic. These are both linear combinations of the order statistics and so a logical generalisation of both is 20.

n e = .l a. x(J"), J=l J

j particularly since X(j) is itself a LT as EX(j) = n+l e

Restricting our attention to the class of linear combinations of order statistics, we can go to the second stage in the Method of MMSEE and find the element in that class with smallest MSE. In this case the problem involves only elementary calculus, as follows: differentiating MSE(e) with respect to ai , we have

a MSE ( 0) ( n ) = E 2 .l a. x( . ) - e x( ,·) aai J=l J J

Equating the partial derivative to zero for i = 1,2, ... ,n we obtain a system of equations to be solved for the optimal coefficients

i = 1,2, ... ,n .

If X - U[O,e] then

i i("+l) 2 E x(i) = n+l e and E x(i) x(j) = (n+d(n+2) e , i ~ j

Thus we obtain the solution

. al = a2 = . . . = an-1 = 0 ,

Hence the MSE-optimal estimator in the class of linear combinations of order statistics is 21.

( C) Inverse PIT : this method arises from the PIT, which we can use to examine the way in which a random variable is generated and how the generating process depends on the parameter of interest. We have already shown that any continuous random variable X (and hence the i th order statistic X(i)) may be considered to have been generated from a standard uniform variate by the generating equation

where 6 is the parameter of interest, U(i) is the i th order statistic from a sample of size n from U[O,lJ and all other parameters are assumed to be known. Interchanging argument and parameter, we can write

and finally

(6)

Replacing U(i) by its i/(n+l) , we obtain the estimating equation

,..a= h-1( x(i) In+l i )

This estimator is equivalent to that obtained by solving

F(X(i)le) = i/(n+l) for 0 . Essentially, we are determining how the relationship between X and U depends on 6 and using this information to construct an estimator. This idea is best illustrated by a familiar example: if X is a 3-parameter Weibull variate with distribution function ...... 1 F(xla, 8, y) = 1 - exp{ - (x8°)Y} ; a< x < 00 , 22.

then -1 y X = a+ S( ln{(l-U) }) ( 7)

This generating equation shows us how the variability of the Weibull variate X is related to the variability of a standard unifonn

variate U through the parameters a, f3 and y.

The Inverse PIT method is a one-parameter method in that we assume all parameters except one are known. If we are dealing with a model containing more than one unknown parameter, we must successively make each parameter the parameter of interest.

Suppose we assume first that f3 and y are known and a is

unknown. Making a the subject of the generating equation ( 7), we have 1 y a = X - S(ln {(1-U)- })

This clearly implies that X-k is a LT for a We can generalise this idea and propose several possible forms for an estimator of a:

;,,. ( a) , a = c.1 ex(l 1. ) - d)

;,,. n ( b) a = l c.(\·) - d) i=l 1 1

.... n ( c) a = l Ci (X(i) - di) i=l The MMSEE in the class of location-invariant estimators of a is Pitman's estimator which is a special case of each of the above fonns.

Now suppose we assume that a and y are known and f3 is unknown. Rewriting the generating equation to make f3 the subject, we obtain 23.

It is clear that X - a is a LT for B , so we can generalise to the following estimators:

( d) B" = Ci (\ i) - a)

A n (e) B = l Ci (\i) - a) i=l

Determination of the optimal values of the arbitrary coefficients {ci} is explored numerically by extensive Monte Carlo simulation in Chapter II I.

Finally, assume that a and B are known and y is unknown.

We can rewrite the generating equation to make y the subject:

y = (8) which suggests the following as estimators:

A ( f) y = c. ln , 1 ('ira)

n A ( g) y = l c. ln (x< ira) i=l 1

Equation ( 8) cl early shows that ln (' i ra) is a LT for y .

Once we have several LT's we can combine them to form a compound LT , to which we can apply our optimality criterion (minimum MSE) to derive an optimal estimator. An obvious and most mathematically tractable way to combine them is a linear combination, say 24.

m a = 'l c., g. , ( -X) (9) i=l

Thus, we are constraining our estimator to be in the class of linear combinations of given LT 1 s. This is not as severe a restriction as may be supposed. Since each g. is a LT we can write 1

(ki :/ 0)

* * If g. (X) = g.(X)/k. then g1. (~) is an unbiased estimator of e 1 - 1 - 1 - and we can write (9) as

A m 8 = l ai gi * ( ~) = c. k.) i=l 1 1 that is, our estimator is a linear combination of unbiased estimators of 8 and is itself unbiased if the ai 's sum to 1 .

When we combine several estimators, we are hoping to produce a better one. If the ai 1 s sum to 1 , then we are minimising MSE(8) subject to E(8) = 8 and we are really seeking a minimum variance unbiased estimator in this class of estimators. For example, if k is identically zero in (5), then

and

A m 8 = c. .l 1 gi ( ~) 1=1 m = 8 l Ci hi(~, n) i=l

= 8 H(c, U, n, m) 25.

A that is, e is itself just a LT, but one whose distribution depends on the arbitrary coefficients c as well as on U , n and m. Thus

2 2 2 MSE(e) = E(e-e) = e E{H{c,- -u, n, m) - 1} " and so the MSE-optimal coefficients ~ for the estimator e are those which minimise the squared deviation of H from 1 .

We will now apply these ideas to several familiar examples.

Example (1) The : the distribution function is

1 F(xl•, B) = 1 - exp{ - 8 (x-•)J ; • ~ x < 00 •

The MLE's for • and B are ; = X(l) and 8 = X - X(l)

The generating equation for an Exponential random variable X is

Suppose a is known. Making • the subject of the generating equation implies that X - k is a LT for •, as well as being location- ; nvari ant for • . The Pitman estimator for • is

"* B · • = x(l) - i and this is the MMSEE in the class of location-invariant estimators.

Now assume that • is known. Making B the subject of the generating equation, we obtain

B = X - ln{{l-u)-1} which we can generalise to 26.

n B = L cJ. (x(J"} -

Minimising the MSE of 8 with respect to = = (c1, c2, ... ,cn} , we obtain the following system of equations for the optimal c :

; i = 1,2, ... ,n .

From Sarhan [19], we obtain

r . -1 E(X(}-

m = rl I (n+ 1-i} - 2 , i=l where m is the minimum of r and s . Substituting and solving, we obtain

cl= c2 = ... =en= n+l1 ' so the MMSEE for S in the class of linear combinations of order statistics minus a constant is

"'* n (- S = -n+1 X -

Substituting best estimates for unknown parameters and solving "'* "'* for

x - x( l} "'* ; "'*

The MMSEE's may be compared with Sarhan's UMVUE's 27.

The comparison shows that the two fonns are essentially identical and that we have decreased the MSE by introducing a little bias. These estimators for S are, however, unifonnly dominated (MSE) by the non-location-invariant estimators of Arnold [2J and Brewster [3J. This shows that the method of Inverse PIT does not necessarily lead to a class of estimators whose optimal member cannot be dominated by an estimator from another class.

Example (2) The Power Function distribution: this is a generalisation of the U[O,eJ distribution and has distribution function 1

X0) y F( x Ie, y) = ( 1 0 < X < 0 , 0 < y

First we estimate e on the assumption that y is known. The MLE for e is X(n) . The generating equation is X = e uY , which indicates that X(i) is a LT for e and hence a generalised esti­ mator is

A n e = I ci x( i) i=l

Minimising the MSE of eA with respect to c , we find the optimal value for c to be

c 1 = c 2 -··· - cn-1 = 0 and cn = ·1 + _Ln+y

This is not unexpected, since X(n) is sufficient for e

Now we estimate y on the assumption that e is known. The

MLE for y is ,._ 1 n y = I ln ( 8 ) n i=l \n 28.

Making y the subject of the generating equation, we find

y = ln(X/0) = ln(0/X) ln( U) ln( u- 1)

Considering this together with the MLE, it seems reasonable to generalise to

n A y = I ai ln ( 0 ) i=l . \i) ,... We could minimise the MSE of y directly, but in this case it is

simpler to consider the transfonnation Y = ln(0/X) . It can easily be shown that Y has an Exponential distribution with truncation parameter zero and scale parameter y. From the previous example, we see that the MMSEE for y is

Y = .:'1 Y = ~1 1 ln [xti)J '

so the estimating equations are

= ( 1 + ~) x( n) e n+y and

A 1 n ( § ) Y = n+ 1 . L l n X-:- , 1=1 (1) which we solve simultaneously for 8* and y * . This is most easily done by successive approximation, starting with one of the MLE's, then substituting and resubstituting in one equation then the other until the estimates converge. 29.

Example ( 3) The Pareto distribution: the distribution function is 1 y F(xlB,y) = 1 - (;) ; 8 ~ X < 00

The MLE's for B and y are

A 8 = x(l) '

yA = 1 In ln (\i))- n i=l 8

The Inverse PIT Method suggests estimators for B and y of the form: n A 8 = l a. x( i) i=l l ' n yA = l c. ln ((/)) i=l l

Minimising the MSE of A with respect to a we obtain 8 - ' a = 1 - _::£___ 1 n-y

It is easily shown that Y = ln(X/8) has the Exponential distri­ bution with truncation parameter zero and scale parameter y , so, from Example ( 1), we know that the MMSEE of y is

A n - = 1 1 n i Y = n+l Y ~ (x( >] n+l i=lL B Hence, the estimators satisfy

y = n!1 i~l ln nil] ~- A* A* which may be solved simultaneously for S and y as in Example (2).

(D) Generalised Likelihood Analysis: This is a method in which we try to produce LT's from sample likelihoods. The whole­ sample methods are Bayesian; that is, we assume a prior distribution for the unknown parameter e , then obtain a posterior distribution by using the whole-sample likelihood in the usual way. We can then apply measures of central tendency to the posterior distribution of e to find estimators; for example

(a) the mean of the posterior is the usual Bayes estimator and may be called quasi-Bayes when the prior is arbitrary.

(b) the mode of the posterior when the prior is uniform is just the usual MLE. When the prior is not uniform, the mode of the posterior may be called the Generalised MLE.

(c) the of the posterior is not used much because of the difficulty of computing it. It usually lies between the mean and the mode, but may differ in form from both and so may suggest an appro­ priate functional form for an estimator.

The link between Generalised Likelihood Analysis and the previous work on selection of estimator classes lies in the choice of the prior

f(el~) , where ~ is a vector of arbitrary coefficients. The

posterior and hence all of the above estimators will depend on ~ , which may then be selected to minimise MSE. This is an appealing idea theoretically, but it will usually be found that minimisation of the MSE of any of the whole-sample estimators above will not be 31. possible analytically and will probably involve a lot of difficult numerical work. In principle, however, each prior f(el~) defines a class of estimators from which the MSE-optimal element may be chosen.

It is not clear how we should select the prior f(el~) when in fact there is no prior knowledge available about e. It may be desirable to select a form which becomes uniform for some value of a , but it need not have moments and in fact need not even be a proper probability distribution (see Example (4)). In general, we are using the Bayes procedure not in the usual way, but merely as an ar­ tifice to find a class of formulae which depend on arbitrary constants a. These constants are then determined for the MSE-optimal member of the class.

The alternative to whole-sample methods is part-sample methods; in particular methods based on the i th order statistic X(i) from a sample of size n . This may also be analysed in a Bayesian way, starting with a prior for a and using the density of X(i) as a sample likelihood to obtain a posterior distribution for a given X(i) . As with the whole-sample case we may use the mean of the posterior (quasi-Bayes), the mode (Generalised MLE) or the median to produce an estimator based on X(i) . Finally, as above, we can analyse the posterior distribution of a given X when the sample size is 1 , again obtaining mean, mode and median estimators.

Once we have one or more estimators or LT' s based on \ i) , we can generalise to an estimator for a full sample, as in the examples to follow. 32.

Example (4): for the Pareto distribution the whole-sample likelihood is

n TT 0 < B < x(l) . i=l

Suppose the shape parameter y is known and we assume a quasi­ prior for B of the form f(Blr) = B-r, where r is an undetennined constant. Then the posterior for B given x is

n --r f(SI~) = (!;- - r + !) xtl) (xf)Y ; 0< S ~ x(l) .

Therefore the quasi-Bayes estimator for B is

i\ = E(SI ~) = { 1 - (~ - r + 2) - l } \ 1)

This describes the class of estimators generated by the quasi-prior B-r. From Example (3), we see that the MSE-optimal member of this class is

B"* = { 1 - (ny - 1)-1 } X(l)

Thus the optimal value of r is 3, which that the prior for B is not a proper probability density. This value for r is also that used in Pitman's estimator for scale. In general, the optimal quasi-Bayes estimators for location and scale for the prior f(elr) = e-r coincide with the Pitman estimator for location (r = 0) and scale (r = 3). For comparison, the mode and median of the posterior of B are \l) and exp {- (~ - r + 1) ln(2)} X(l) . 33.

Example (5): Suppose we wish to estimate the Weibull shape parameter y using quasi-Bayes estimation based on the density of X(i) with arbitrary power function quasi-prior for y . The distribution function for the Weibull is

1 F(xjy) = 1 - exp { - (\/)Y}

The density of the ; th order statistic is

where Fi= F(x(i)IY) and fi = f(x(i)IY) . Let the quasi-prior be

f(ylr) = y-r , where r is arbitrary. Then the mean of the posterior distribution of y is

where C(r,i,n) is a constant depending on r, i and n . This clearly generalises to

l C • 1n ( \ i)-cj) 1 y = . , B J 1 An advantage of this method is that we can use as few or as many order statistics as we like in the estimator, allowing for missing values, truncation, etc.

Example (6) Generalised Likelihood Analysis for the shape parameter of the Power Function distribution (see Example (2)): assuming e is known, the MLE for y is

,... 1 n ( y = - I 1n e ) n i=l ~ 34.

Let the prior for y be f(ylr) = y-r, where O < y < 00 • Then the posterior for y is

"') n+ r-1 ( ) "' = ( ny - n+r {- n y } f( y Ix) r{ n+r-1) Y exp y and it is easy to show that the mode, median and mean of the posterior

A are all constant multiples of y . From Example (2), the MMSEE of

• n A Y , s n+l Y

Example (7): the density of the one-parameter Gamma distribution is a-1 -x X e f(xla) = r(a) a > 0

.The MLE for a is

; = W-1 ( ~ i~l ln{X(i)})

d where w(a) = da ln{r(a)}

When n = 1, &= w- 1(ln{X}) . It is easy to show that E(ln {X}} = w(a), which implies that w- 1(1n{X}) may be a LT for a. A generalised estimator for a is

a

In general, when no obvious estimator exists for a parameter e , we may "manufacture" one by starting with some convenient transformation g(X), finding its expectation as a function of 0 , then inverting the function to produce an estimator. Thus, if E g(X) = h(e), then 35. e = h-l {E g{X)} and an estimator for e is 8 = h- 1{g{X)} . This may be generalised in some appropriate way to an estimator based on the whole sample. This idea is illustrated in the next example.

Example {8): the density of the Beta distribution is

p-1( )q-1 X 1-X f{xlp,q) = B(p,q) 0 < X < 1 , p,q > 0 .

Thus, if p+r > 0 and q+s > 0 ,

E Xr(l-X)s = B(p+r, g+s) B( p ' q ) and so

= L. E( X) p+q (10)

Suppose that q is known and we wish to estimate p . Making p the subject of (10), we obtain

P = q E(X) 1-E( X)

This indicates that a logical estimator for p is

Now, if q > 1

E(p) = q ~ and hence an unbiased estimator for p is

"* X p = (q-1) - 1-X Therefore a reasonable generalisation to the whole sample is

n x( i) p" = l c . .,,....1_,,.,X_.__ i=l l - (i)

Similar results follow for q . 36.

Example (9) The scale parameter o of the N{µ,o 2) distribution, 2 2 2 where µ is known : if g{X) = X , then E g{X) = µ + o . Making o the subject, we obtain

0 =

This indicates that two "manufactured" genera 1i sati ons are

n 2 2 ½ (a) o = I a. I x< . ) - µ I 1=. l 1 1

n ¾2 A 1 2 2 (b) 0 = I - l\ X(1") - µ I n i=l

A variation on the idea of 11 manufacturing 11 estimators is to use the expectations of functions of order statistics, as

leads to the estimator

Example (10) The scale parameter o of the N{µ,o 2) distribution, where µ is known: if we write E X(i) = µ + o wi , where wi is a known constant and make o the subject, we obtain o = (E X(i)- µ}/wi when wi is nonzero. This suggest the generalisations

A n ( a} 0 = I a. (x(.} - µ) i =l l l 37. 2 Example ( 11) The location parameter µ of the N{µ,cr) dis tri but ion, where a is known: an obvious generalisation of X, the MLE for µ, is

A n µ = l aJ. x( J.) j=l

A Minimising the MSE of µ with respect to a , we obtain the following set of equations

n j~laj E x(i) x{j) = µ E x(i) ; i = 1,2, ... ,n . ·

= µ+a w. and Now, E \i) 1

Clearly the optimal coefficients in the MMSEE of µ depend on the

unknown µ itself. This problem is overcome by substituting the

MLE X for µ . This makes the coefficients random variables. The MSE of this estimator was found by computer simulation. It had the same shape as that of Thompson 1 s [22] 11 shrinkage 11 estimator for µ , of which it is a generalisation. The MSE is slightly smaller than that of X for lµI small and slightly larger for lµI large.

Thompson 1 s estimator for µ is µ = c X with the MSE-optimal value of c being

* { 1 a 2 }-1 C = 1 + n lµJ .

Thompson estimates c* by substituting X for µ, but in the classical theory we choose c = 1 , which, in effect, is equivalent to using c* with lµI replaced by 00 • 38.

Fortunately, the problem of coefficients depending on unknown parameters can be solved by using the approach in the above example.

In a 11 cases examined by the author, ·the coefficients were fairly insensitive to errors in the estimates of the parameters. 39.

Chapter I II: Scale Parameter Simulation Results

In this chapter the ideas in chapter II are implemented numerical­ ly using a large Monte Carlo simulation. The MLE's and MMSEE's of the scale parameter B of the Generalised Pareto, Weibull and Type I Extreme Value distributions are compared for several sample sizes and shape parameter values. The distribution functions are, respectively, 1 F(xjB) = 1- ( 1 + -¥) y

1

F(xlB) = 1 - exp { - ( \cp)Y} :5 X < 00

F(xlB) : exp { - exp (- xsg,] } ; - OO < X < OO

When cp is known, good estimators for the shape parameter y may be found as follows: in the Weibull distribution, let

Y = - ln{(X-cp)/B} . Then Y has distribution function

F(yjy) = 1 - exp{ - exp(- y/y)} ;

Thus estimation of the shape parameter in the Weibull is equivalent to estimation of the scale parameter in the Type I Extreme Value distribution. Similarly, we may show that estimation of the shape parameter in the Pareto is eq~ivalent to estimation of the scale parameter in the

Exponential by using the transformation Y = ln {1 + (X-cp)/B}

The usual location and scale model may be written

F( xl a,13) = F (\t) ( 11) 40. and hence, from the Probability Integral Transfonn, the generating equation for the random variable X is

(12) where U - U[O,lJ . Making a the subject of the generating equation, we have

( 13) which implies that a generalised estimator of scale based on a sample of size n is

(14) where {ci} are undetermined coefficients.

The likelihood function is . L(al~) = a-n TTn f (x.:a- 1----=-µ-·J1 i=l

The MLE for a is therefore the solution of

n t' ( x\ a] X. - ex n 1 - - = 0 . - " l f3 i=l t( X1 ; a]

This equation may be rewritten as

n - f' ( \- a] " 1 f3 = l ( X. - a) 1=. 1 n 1 f ( \- a] n = l d. ( X. - ex) (15) i=l 1 1 41. where the quantities {di} depend on {Xi} , except when the under­ lying distribution is Exponential. For example, in the Pareto distribution

l (l + 1) d. = n y l X. -

In the simulation described below, it was found that the MMSEE coefficients for more extreme order statistics X(i) are smaller than the other coefficients, as might be expected. The above example shows that with Maximum Likelihood something similar occurs, but adaptively; rather than calculating the coefficient in advance (based on the ex­ pectations and covariances of the order statistics), the coefficient of Xi is directly determined by X; i tse 1f.

Since we can easily find good estimators for cp , suppose we assume it to be known and hence without loss of generality let it be zero. Then the optimal values of the MMSEE coefficients {ci} are the solutions of

j = 1,2, ... ,n

It is possible to find the optimal coefficients directly, using the joint distribution of X(i) and \j) followed by approximate inte­ gration of the defining equations for E X(i) X(j) and E X(i) . This involves many numerical difficulties and the approach used in this simulation was to run a preliminary simulation of size equal to the main simulation and use this to find the required expectations 42.

empirically, by averaging over many samples. The coefficients may not be estimated accurately (for the larger sample sizes, some coefficients are occasionally negative and often not monotonic), but the relative behaviour of the MLE and the MMSEE is estimated very accurately. In fact, the simulation was first conducted with 1000, 2000, 5000 and 10000 samples per simulation run. The difference between the results (except for M~~EE coefficients) for 5000 and 10000 was almost negligible, so it was decided that 20000 samples per simulation would be accurate enough for our purposes.

The random number generator used was the so-called HP-25 al­

gorithm u1 = fractional part of (n + u0)5 , which is known to be unbiased and generally well-behaved, except for a slight serial correlation. The random variables were generated using the PIT method.

Several adjustments were necessary to make the numerical work easier. When the ML estimating procedures converged to a negative value, zero was substituted. In the Pareto routine, when convergence was to a value greater than 10000,zero was substituted. This means that the MSE estimates for MLE are slightly optimistic (improper convergence generally occurred in much less than 1% of samples). When n/y is small some extreme Pareto order statistics have infinite , so we automatically set their corresponding coefficients to zero, without using the preliminary simulation to estimate them.

The input data to the simulation program were

(a) sample size n (5, 10, 15, 20, 25, or 30), (b) number of samples m (20000) , (c) scale parameter S , 43.

( d) shape parameter y -

i Wei bul 1 y = 2 ; i = 1,2, ... , 6 n i - = = 1,2, ... , 9 Pareto y "20 i

The shape parameter is taken to be known, since estimation of y may interfere with an accurate comparison of the two scale parameter estimators.

(e) initial random seed in the range (0,1) , (f) distribution type (Pareto, Weibull or Extreme Value).

Output consisted of

{a) bias (divided by S) and MSE (divided by s2) of MLE(s 1)

(b) optimal modifying constant k* for MLE and MMSEE. This was the constant k for which. kS" had the minimum MSE . It is found as follows:

MSE(k8) = E(kB - s) 2

a MSE(kB) = E2(kB - s)s = o ak

Thus = E(B/S) E(B/S) 2

The minimum MSE(divided by s2) is therefore 44.

= l {E(B/B)} 2 - - E(B/B) 2

The bias (divided by B) of the optimal estimator is

bias * = {E(k*S) - B} B ~

= E(S/B) E(S/B) - 1 E(S/B) 2 MSE * 7

The constant k* can be accurately estimated from the simulation. Since the MMSEE coefficients were also obtained by minimising MSE, we would expect the optimal constant for MMSEE to be very near to one. In fact this does occur throughout the simulation and so the modified MMSEE is virtually the same as the unmodified MMSEE.

*" *" ( c) MSE for the modified estimators k1 B1 and k2 B2 ,

( d) MSE-optimal coefficients, estimated from the Monte Carlo simulation, of a linear combination of the MLE and MMSEE. These coefficients are found as follows: let

A A A 2 Then minimising MSE(B') = E(c1 s1 + c2 s2 - B) with respect to -c , we obtain "' ,... ,... E ( B~~2] E (B;)

C = ,... 2 - ,... E (Bs2J E (Bs2] 45.

Table 1: MSE-optimal constant k* for modified MLE y n 0.5 1.0 1.5 2.0 2.5 3.0 5 .976 .832 .638 .442 .287 .164 10 .988 .904 .785 .636 .498 .351 15 .993 .940 .848 .733 .607 .485 20 .993 .954 .882 . 791 .682 .574 25 .995 .961 .905 .825 .734 .638 30 .996 .966 .918 .851 . 771 .690

Table 2: Bias of unmodified MLE n~Y 0.5 1.0 1.5 2.0 2.5 3.0 5 -.025 .007 .074 .194 . 391 .686 10 -.013 .003 .036 .106 .202 .328 15 - .010 -.002 .024 .067 .137 .208 20 -.006 -.001 .020 .048 .099 .160 25 -.005 .001 .013 .041 .081 .125 30 -.004 .001 .013 .035 .067 .102

Table 3: MSE efficiency of unmodified MLE relative to the MSE-optimal linear combination of MLE and MMSEE · n y 0. 5 1.0 1.5 2.0 2.5 3.0 5 98. 77 82.59 58. 77 35. 93 19.62 9.16 10 99.44 90 .10 75.33 56.22 39.74 25.09 15 99.70 94.20 82.44 67.78 51. 71 38.55 20 99.58 95.47 86.19 74.73 60.57 47.61 25 99.76 96.03 89 .24 78.50 66.44 54.95 30 99.82 96.55 90.45 81.60 71.04 60.92

Table 4: Bias of modified MLE (equal to minus optimal MSE}

n y 0.5 1.0 1.5 2.0 2.5 3.0 5 -.0489 - .1627 -.3155 - .4726 -.6005 -.7226 10 -.0249 - .0930 - .1870 -.2965 -.4016 -.5333 15 -.0166 -.0618 -.1323 -.2174 - . 3102 -.4140 20 -.0126 -.0471 - .1009 -.1717 -.2504 -.3345 25 - .0100 - .0389 - .0830 -.1416 - . 2071 - .2829 30 -.0083 - .0327 -.0697 - .1194 - .1780 -.2401 46. which may be solved for c (all expectations being estimated from the simulation).

(e) MSE of the optimal combination in (d). The relative efficiency of all estimators was computed in relation to this optimal MSE.

The simulation results are so similar for all of the distributions that we give, in the Tables 1, 2, 3 and 4 opposite, only those for the Weibull distribution and the comments following apply to all three distributions.

The Weibull distribution with shape parameter 1 is just the Ex­ ponential distribution. This allows us to check the accuracy of the simulation. The bias of unmodified MLE (Table 2, column 2) should be zero and the optimal modifying constant k* for MLE (Table 1, column 2) should be n/(n+l). The largest percentage error in k* is -0.56% and the largest error in bias is 0.007.

As sample size was increased, the efficiencies of all estimators (modified and unmodified) increased, all biases approached zero quickly (approximately as ~) and the optimal constant k* for MLE quickly approached 1 (the biases of the modified estimators always remained negative and k* was always less than 1).

Suppose we now consider the linear combination of MLE and MMSEE

"'-* A from part ( d). Let ri = E(Si/8) . Then 8i = 8/ri is unbiased for 8 A "'-* A* and we can write (16) as 8' = a1 81 + a2 82 where ai = ci ri . The coefficients a1 and a2 are more meaningful than c1 and c2 since, if a1 + a2 = 1 , 8' is unbiased for 8 For all three distributions a1 (the coefficient of the unbiased MLE) is always much larger than a2 47.

(the coefficient of unbiased MMSEE). Typically a1 ~ .85 and a2 ~ .15 , indicating that the MLE is much more important to the accuracy of the combined estimator than is the MMSEE.

For all three distributions, the unmodified MLE is quite inefficient for nearly all combinations of sample size and shape parameter (being much worse for large shape parameter values), while the efficiency of modified MLE is nearly always better than 99%. In fact, the bias and MSE of both modified estimators is nearly the same everywhere, modified MLE usually being slightly better. We conclude that modified MLE alone is quite adequate for the estimation of B , the only advantage of MMSEE being that its coefficients may be tabulated and hence it is convenient for hand calculation, whereas MLE is inevitably an iterative procedure which is time-consuming for large samples. 48.

Chapter IV: Distribution-free MMSEE

In this chapter, we apply the theory of MMSEE to the problem of the estimation of the centre of a symmetric distribution. The estimator developed is of quasi-linear form, with strong associations with the estimators of Takeuchi [21], Johns [llJ, Switzer [20] and Jaeckel (see [lJ) . It is compared in a Monte Carlo simulation study to the most robust estimators in the Princeton Robustness Study (Andrews, et al [lJ) and found to be superior. Future improvements which are compu­ tationally impractical at present are discussed, as well as the develop­ ment of an adaptive 11 super-es timator11 •

The problems of nonparametric location estimation and robust location estimation are distinct theoretically, but we will consider them to be essentially identical in practice: the aim is to derive an estimator which has high efficiency irrespective of the underlying distribution.

Robustness can be thought of as the first derivative, in some sense, of efficiency; it is a measure of how much the efficiency changes when the underlying distribution changes. Over the past thirty years, attempts to produce robust estimators have mainly come from two .ap­ proaches:

(1) theoretical robustness/contamination - in this approach, used by Huber, Hampel and others, we assume a basic density f, with a small contamination (or source of ) coming from a contaminating density f * . 49.

(2) adaptive estimation - in the adaptive approach, pioneered by Hogg {[9J and [lOJ), we attempt to use the sample to provide some idea of the characteristics of the underlying distribution. We then select the estimator most suitable for that distribution.

In the first approach we attempt to identify and exclude the outliers and base our estimate on the remaining data. In the second method we try to estimate the underlying distribution as accurately as possible. The connection between the two is that as the distribution becomes more extreme (that is, has higher ), the proportion of outliers increases. As might be expected, the first approach gene­ rally does better when there is a significant proportion of outliers and the second does better when extreme contamination is small.

For the general location and scale model

= F( x-µ ) F(xlµ,o} a , the Inverse PIT Method leads us to consider a linear combination of order statistics as an estimator for µ, say n µ = l C. X( . ) j=l J J

A Ideally we would minimise the MSE of µ with respect to ~ to obtain the best estimator. However we have restricted attention to unbiased estimators to reduce the analytical and computational complexity of the n problem. Since the distribution of X is symmetric, E c.= 1 and j=l J The problem of minimising MSE reduces to a problem in Lagrange multipliers: minimise

I c . x( . ) - µ) 2 - v, ( I cJ. - 1) j=l J J j=l 50. where A is a Lagrange multiplier. Now

n since l C- = 1 . Differentiating with respect to ci , we obtain j=l J

n ~ ac. = 2E l cj(X(j)-µ)(X(i)-µ) - 2A 1 j=l

Now, if E = µ+CJ w. and CJ • • = then x, i) 1 1J Cov{\i), X(j)} '

n ~ = 2 c.(CJ .. + CJ 2w.w.) - 2A ; i = 1,2, ... ,n ac. l J 1J 1 J 1 j=l n = 2 l c. CJ • • - 2A (since wj = - wn+l-j) j=l J 1J n Setting the derivative to zero and writing together with }: c.= 1 , j=l J we have

= [ ~T -~ l [! l [~ l ' where V = (CJij) . The solution is

v- 1 1 1 C* = - and A* = lT v-l 1 1T v-l 1

The fact that V is doubly symmetric ensures that cj = cn+l-j This solution is essentially identical to that for Best Linear Unbiased Estimation of a mean, as considered by Lloyd [14].

It now remains to find an estimate of V from the sample. We use the same method as is used for the corresponding problem in parametric estimation: that is, substitute estimates for unknowns wherever necessary. 51. Now,

e and = f x pi{x)dx , where

p;( X) = { B( i , n+ 1- i ) r 1 Fi - l ( X) (1- F( X) ) n- i f ( X)

The unknowns are the endpoints

,-. I 1 C ( x) = ( n+ 1 fo ,. where, for i = 0,1,2, ... ,n , ~i = X(i+l) - X(i) , X(O) and \n+l) are defined to be

. x- x( . ) = , + , c( x) n+l ( n+l)~. for X(i) 5 x 5 X(i+l) 1

The parameters

" h(n)(X(l)-Sl) - h(l)(X(n) - Sn)

h(n)(X(n)-Sn) - h(l)(X(l) - S1) e" = h2(n) - h2(1) . n+l where h(i) = (n~l) , a(i) = h(i+l) - 2h(i) + h(i-1) ,

n n Sn = • l a( i) X( i) and s1 = l a(i) X(n+l-i) 1=1 i=l 52.

We now evaluate the estimates of E{X(r)J and E{X(r)X(s)J . Our estimate of E{X(r)J is of the form

8" L x{ B(r,n+l-r)} -1 "r-1F (x){l - "F(x)} n-r "f(x)dx

Using C to estimate F , "IC to estimate f and, for convenience, " " " n-r writing X(O) for

n n-r . X(i+l) . E{X(r)} = {B(r,n+l-r)J-1 l l (-l)J (n:r) f x cJ+r-l(x)C'(x)dx i =O j =O J X( i )

n n-r . = {B(r,n+l-r)r1 L L (-l)J (nJ:r) v1(i,j+r-1) i=O j=O where, for i = 0,1, ... ,n , m = 0,1, ... ,n-1 and a> 0

V = a (i,m)

Similarly, 1 n n-r . = {B(r,n+l-r)J- 1 1 (-l)J (n~r) v2(i ,j+r-1) i=O j=O J

For r < s , the joint density of X(r) and X(s) is

Prs(xr,xs) = kn(r,s) Fr-l(xr) {F(xs) - F(xr)}s-r-1

{1 - F(xs)Jn-s f(xr) f(xs) for xr < xs where kn(r,s) = {B(r,s-r) B(s,n+l-s)}-1 and hence 53.

With the same estimators as before, followed by appropriate binomial expansions, we have

s-r-1 n-s ( ) l l s-J~-1 j=O k=O

= i-1l v Ct,s-j-2)+ ; ( x 2s-j-l(x ) - R,=Q 1 S-J-1 S S

. s - j-1 ( n+ 1 )Ll • { . . s -j } ) l "S-J ( l ) X( i) ( n+l 1) - s - j c (xs) - n+ 1

Substituting this back into the expression for E{X(r)X(s)} we have

n s-r-1 n-s . E{X X } = k (r,s) l l l (-l)k+s-r-1-J (s-J~-1) (n-ks) (r) (s) n i=O j=O k=O

X ( i ) s-j-1 (, . k) ( n+ 1) Ll i (i) n+lJ Vl ,J+ - s - j

Vl(i,j+k) (n!1). s-j) } ] .

Integrating by parts the expressions for v1 and v2 , we obtain

1 [ ·+i m+l . m+l (n+l)fl. { ·+l m+2 . m+2}.~ Vl(i ,m) = m+l X(i+l) (~+1) - \i) (n~l) - m + 21 (~+1) -(n!~ :J 54.

We now have all we need to implement the method in a computer simulation.

Preliminary simulations revealed an interesting technical point. The relative efficiency of the new estimator seemed to decrease with increasing sample size and the coefficients for larger sample sizes appeared to diverge from rather than converge to the theoretically correct values. When the data consisted of the expected values of the order statistics from the Uniform distribution, the method produced the correct coefficients (1/2, 0, ... , 0, 1/2) for small sample size n , but for moderate n showed apparently random errors which quickly became unacceptably large. The cause was found in the estimating equations for E{X(r)} and E{X(r) X(s)} which involved the difference of two nunbers which were almost equal. The problem grew rapidly with sample size and soon exceeded the computer's maximum accuracy, even in double precision (16 decimal digits) mode. Our solution for large samples was to restrict the sample size for which the covariance matrix was estimated. To this end the idea of a "hypothetical" sample from the estimated distribution function was introduced. Instead of estimating the covariance matrix for a sample of size n, we estimate the covariance matrix for a sample of size z using multiple samples of size z from the estimated distribution function. Thus we are also forced to replace the original sample of size n by a hypothetical data set of size z which consists of estimates of the expected values of the order statistics from a sample of size z . Choosing z smaller than n was originally proposed to get around the accuracy limitations mentioned above, but allowing z to be arbitrary was later found to have great theoretical importance. 55.

In our approach, the only use we are making of the original data is to estimate the distribution function. Detennining the properties of statistics by sampling from an estimated distribution function is essential­ ly Efron's [7] bootstrap method. Here, in preference to the usual em­ pirical distribution function, we use the more tractable C{x) which enables us to estimate means and covariances of order statistics directly from the defining expressions without resorting to repeated sampling.

To evaluate the new method of estimating covariances, we used ar­ tificial data consisting of the expected values of the order statistics for several distributions. The estimated variances of the order statistics were compared to the known values. For moderate distributions, such as the Nonnal and Logistic, it was found that the variances of the central order statistics were slightly overestimated, but as we moved away from the central order statistics the amount of overestimation_ gradually increased. This was considered to be a good feature, for it would tend to emphasise the central order statistics at the expense of the extreme ones, thus contributing towards robustness at a small cost in efficiency. However, the rate of increase in overestimation of variance decreased as we neared the extreme order statistics and became negative at the second and second last order statistics, which thus had their variances only slightly over­ estimated. The first and last order statistics actually had their va­ riances underestimated by 50% to 70%. This characteristic became more pronounced for more extreme distributions such as the Laplace and very serious for the Cauchy. Table 5 gives some examples of percentage over­ estimation of the variance of X{i) . 56.

Table 5: % overestimation of the variance of x(i) Normal Laplace Cauchy i (n=20) (n=19) (n=20) 1 - 56.29 - 68.31 - 100.00 2 4.25 6.04 - 100.00 3 26.37 46.54 - 1.07 4 25.43 48. 71 162 .11 5 19.47 38.85 233.87 6 14~75 30 .95 198.97 7 11.97 27.26 132.76 8 10.26 26.97 82.73

The reason for this behaviour is clearly related to information density - near the centre of a distribution, the sample data are relatively plenti­ ful and therefore the estimation accuracy is high. Near the tails, the data are relatively scarce, making estimation hazardous. The solution is to trim the sample at the extremes so that the faulty variance esti­ mates do not affect the estimator. The amount of trim should depend on how extreme the distribution is; for the Nonnal, a trim of two at each end is sufficient, but for the Cauchy a trim of three or four would be preferred.

The improved estimator can now be written

A µ = ( 17) where t is the trim amount (0, 1, 2, 3 or 4), z is the hypothetical sample size and Ez{X(j)} is the estimated expected value of the j th order statistic of a sample of size z from the estimated distribution function of the original sample. 57.

This is strongly reminiscent of Takeuchi's [21] idea, which, using a different initial approach, comes to basically the same formula. However, the present method has several theoretical advantages. The hy­ pothetical sample size in Takeuchi's method should be small in relation to n , while z above is arbitrary. Takeuchi's method does not allow for any trimming; this could be a liability for the more extreme distri­ butions. The main disadvantage is that Takeuchi's method is fixed, whereas the above approach leaves much scope for fine tuning, with many choices to be made optimally.

The new estimator used in the simulation study to follow had the basic form (17), with some enhancements meant to improve its performance at and near the (and consequently possibly poorer · performance for more extreme distributions). The enhancements were:

(1) trim was taken to be two for all distributions and sample sizes. (A larger trim for the more extreme distributions would have improved the performance of the estimator.)

(2) hypothetical sample size z was chosen as large as compu­ tationally possible. (On the VAX 11/750 minicomputer with 16 decimal digits of accuracy in double precision, this was found to be 17.)

(3) the optimal coefficients were adjusted to be nonnegative by setting to zero all negative coeffici~nt estimates and rescaling the rest to sum to one. For some extreme distributions such as the LI-shaped distribution, f(x) = 1.5 x2 ; lxl :5 1 , which has low kurtosis and the Cauchy, which has high kurtosis, some optimal coefficients really are nega­ tive, so this procedure loses some efficiency there, but hopefully gains at and near the Normal. 58.

(4) for the Nonnal, Logistic and Cauchy distributions, it is found that the optimal coefficients increase from the outer to the inner order statistics, which is to be expected since the inner order statistics have smaller variances.

When we consider optimal tri1T111ed estimators, the same pattern occurs, except that for the Nonnal and the Logistic the nonzero coefficients of the most extreme order statistics are much larger than their neighfuours;

thus, if the trim is 2, c3 is much larger than c4 . In order to gain further efficiency at the Nonnal and the Logistic, this structure was imposed upon the estimated coefficients. If a* were the original estimated coefficients and b* were these adjusted for negativity and * * * m = [{z+l)/2), then nothing further was done if b4 ~ b5 ~ ... ~ bm. If the sequence of coefficients was not monotonic increasing, then it was * * made monotonic increasing first with respect to b4 , then b5 , etc, giving (m-3) monotonic sequences c , -. each of wh:ich was then nonnalised to sum to one. From these, the one 11 closest 11 to ~ * (distance=rlci-bil)* was taken as the final coefficient estimate. This turned out to have only small effects on efficiency but was retained mainly for aesthetic - reasons.

To test the robustness of the new estimator, which we can call BUQL for Best Unbiased Quasi-Linear, we ran a Monte Carlo simulation whi eh compared the MSE I s of BUQL and the best avail able alternatives. The Princeton Robustness Study used .sixty-five estimators, from which we chose a subset. We wanted to include the best example of each estimation approach; in particular Takeuchi 1 s method (# 49), Johns• method(# 51), Hampel 1 s method (# 33, -reconmendeEf by the authors of the PRS as best estimator overall) and Huber 1 s method (# 24 is a good representative). l~e then selected all estimators in the PRS which were better than the least efficient 59.

of thisgroup at the Normal and the Cauchy for sample sizes n = 20 and n = 40. This added the set consisting of # 12 (JBT : 2-choice Jaeckel adaptive trim), # 34 (ADA: adaptive Hampel),# 46 (Skipped mean) and # 47 (Skipped mean). However,# 34 was omitted because # 33 is also Hampel-type and# 33 is regarded by the authors of the PRS to be superior. The skipped means# 46 and# 47 were omitted because their relative efficiencies at the Normal fall between n = 20 and n = 40, whereas all the other estimators gain relative efficiency. The final set of estimators used in the simulation were:

1 # 12 JBT Jaeckel adaptive trim. 2 - # 49 TAK Takeuchi adaptive. 3 # 24 D07 One-step Huber.

4 - # 33 12A Hampel; a= 1.2, b = 3.5, C = 8.0 . 5 - # 51 JOH Johns' adaptive. 6 - BUQL Best Unbiased Quasi-Linear; adaptive.

It was decided to use sample sizes 10, 20, 30 and 40 from the Normal, 10% Contaminated Normal, Logistic, Laplace and Cauchy distributions and to estimate MSE of all six estimators by repetition over a large number of random samples. The random number generator used was the so-called HP-25 method, with u1 = .fractional part of (~+ u0)5 . Normal variates were generated by the Box-Muller method; the 10% Contaminated Normal was 90% N(0,1) variates and 10% N(0,32) variates; the others were gene­ rated by the Probability Integral Transform method. In each combination of sample size and distribution, the MSE's were transformed into per­ centage deficiencies, where, if MSEmin is the smallest MSE in the simulation, then the (relative) efficiency of the. i th estimator is 60.

MSEmin/MSEi and the deficiency of the i th estimator is !-(efficiency)½.

This transfonnation makes comparison easier because it considers ratios of standard errors, not MSE's. A 0% deficiency means the estimator was best for that combination of sample size and distribution; a 10% deficiency means a roughly 11% larger than optimal; a 50% deficiency means a standard error twice that of the best estimator.

A simulation study over the whole range of combinations was run using 1000 samples at each combination. When the ranking of the esti­ mators was compared with the ranking in the PRS, there were a few minor differences, so the whole simulation was repeated using 2000 samples at each combination. This time the rankings corresponded exactly and the standardised deficiencies (defined below) were fairly close. This suggested that it would be useful to do an analysis of the rankings as well as the deficiencies.

Table 6 is a listing of the computer output, tabulating deficiencies by estimator, sample size and distribution type. Since the deficienci.es generally decrease with sample size, but increase as the distribution be­ comes more extreme, we standardise by dividing each simulation (column) by its average. This enables us to make fair comparisons between sample sizes and between distributions, since each deficiency is now expressed as a multiple of its simulation average. Table 7 is a listing of the defi­ ciencies in Table 6, standardised by simulation combination. Since what is really important is the perfonnance of the estimators relative to one another, we evaluate the rankings in each simulation and these are listed in Table 8. To enable us to draw conclusions, these results are summa­ rised in Tables 9 and 10. The standardised deficiencies are averaged over the five distributions in Table 9 and averaged over the four sample sizes in Table 10. 61.

Key -I: JST 1·75 - .l: TAic: 3 ~ 007 'I-: 11A 5": JOH ~ = BUGL

l•S'O -

•• • •·2.. • ~l5 - •• •• If- • • •• u.. ' • • ToO

' , .. ·.· ··. . ~-.:,~ - ~ Ao 1· 00 _____6 .__ .:______" .______00 ____;=------'- • •• "v 00 • ••••• 0 1 ,c"" i O X JC 1( ~~JC¥ JC ,t >t X >t ,c·X ~- 0 1 X 0 • JC 00 3 0·75 - " 0 0 >t 0 JC ,t JC 0 JC 0 ,c 0 0 • 0 1 0 0 6 O·S'o - 0 0 03 I I n~lo n;2.0 62.

The overall perfonnance of the estimators is summarised in the overall average standardised deficiency; BUQL is best with 0.65, JBT is second with 0.83 and JOH is worst with 1.42. However, these figures hide significant differences in behaviour. From Table 10 we can see that JBT, TAK and BUQL perfonn better at the more moderate distributions, whereas 12A and JOH perfonn better at the more extreme distributions. The relative behaviour of the estimators as shown in Table 9 is illustrated in the graph opposite. It can be seen that BUQL is unifonnly best except for sample size 10. JBT and 007 start well, but become steadily worse for larger samples; JOH starts very badly, but rapidly improves. TAK and 12A have no definite trend; they are both slightly worse than average. From the graph, a reasonable recommendation would be to use 007 (one step Huber) for sample sizes up to 15, BUQL between 15 and 50 and either BUQL or JOH over 50.

We may also consider the rankings-the important features here are the number of times that an estimator is best/second best or last/second last. Table 11, which gives frequency of fifth or sixth placing, is most important since robustness may be thought of as avoidance of relatively bad behaviour over all distributions and sample sizes. BUQL is only worst or second worst once (for sample size 10 at the Cauchy); all of the others are worst or second worst at least five times. BUQL is also best or second best most often (Table 12).

Thus our results show that BUQL is the preferred estimator and that it would be vastly superior to the others if not for its unfortunate perfonnance at the Cauchy for sample size 10. Since BUQL is an adaptive estimator, its efficiency would be expected to suffer when the distribution is extreme and the sample size is small. (TAK, the adaptive estimator most closely related to BUQL, also does very badly at the Cauchy for sample size 10). By 63.

Table 6: % deficiencies of estimators in simulation Normal 10% C.N". n=lO n=20 n=30 n=40 n=lO n=20 n=30 n=40 JBT 1.29 2.37 2.71 3. 77 JBT .49 .63 .02 .66 TAK .00 .00 .00 .00 TAK .64 .10 .16 .00 007 2 .86 4.74 4.97 6.61 007 .00 .95 .95 2.24 12A 7.34 6.02 6.49 7.28 12A 2.15 2.05 1.29 2.09 JOH 6.12 3.49 3.80 4.22 JOH 3.55 2.57 2.49 3.36 BUQL .90 .98 1.30 1.98 BUQL 1.00 .00 .00 .60

Logistic Laplace n=lO n=20 n=30 n=40 n=lO n=20 n=30 n=40 JBT .21 .64 .57 .38 JBT 4.50 2.33 4.48 2.60 TAK 1.15 2.70 1.02 1.67 TAK 8.32 6.42 8.33 4.60 007 .00 .00 .71 .45 007 1. 71 .14 2.25 1.08 12A 3.06 .70 2.40 1.69 12A .00 .00 .00 • 35 JOH 4.39 2.41 2.10 1.21 JOH 6.92 4.62 1.53 .00 BUQL .76 .17 .00 .00 BUQL 6.67 2.14 3.20 1.34

Cauchy n=lO n=20 n=30 n=40 JBT 23.95 12.61 12.49 7.16 TAK 36.88 14.24 8.68 3.94 007 23.32 12.42 14.13 9.43 12A .00 .00 1. 33 .29 JOH 8.36 1. 74 .00 .00 BUQL 50 .16 9.50 8.32 6.03 64.

Table 7: Standardised deficiencies (from Table 6) Normal 10% C.N. n=lO n=20 n=30 n=40 n=lO n=20 n=30 n=40 JBT .42 - . 81 .84 .95 JBT .38 .60 .02 .44 TAK .00 .00 .00 .00 TAK .49 .10 .20 .00 007 .93 1.62 1.55 1.66 007 .00 .90 1.16 1.50 12A 2.38 2.05 2.02 1.83 12A 1.65 1.95 1.58 1.40 JOH 1.98 1.19 1.18 1.06 JOH 2. 72 2.45 3.04 2.25 BUQL .29 . 33 .40 .. 50 BUQL . 77 .00 .00 .40

Logistic Laplace n=lO n=20 n=30 n=40 n=lO n=20 n=30 n=40 JBT .13 .58 .50 .42 JBT .96 .89 1.36 1.56 TAK • 72 2.45. .90 1.86 TAK 1.78 2.46 2.53 2.77 D07 .00 .00 .63 .50 007 . 36 .05 .68 .65 12A 1.92 .63 2.12 1.88 12A .00 .00 .00 .21 JOH 2.75 2.18 1.85 1.34 JOH 1.48 · 1. 77 .46 .00 BUQL .48 .15 .00 .00 BUQL 1.42 .82 .97 .81

Cauchy n=lO n=20 n=30 n=40 JBT 1.01 1.49 1.67 1.60 TAK 1.55 1.69 1.16 .88 007 .98 1.47 1.89 2.11 12A .00 .00 .18 .06 JOH . 35 .21 .00 .00 BUQL 2.11 1.14 1.11 1. 35 65.

Table 8: Ranking of estimators in simulation Normal 10% C.N. n=lO n=20 n=30 n=40 n=lO n=20 n=30 n=40 JBT 3 3 3 .3 JBT 2 3 2 3 TAK 1 1 1 1 TAK 3 2 3 1 007 4 5 5 5 007 1 4 4 5 12A 6 6 6 6 12A 5 5 5 4 JOH 5 4 4 4 JOH 6 6 6 6 Bl.Xll 2 2 2 2 BUQL 4 1 1 2

Logistic Laplace n=lO n=20 n=30 n=40 n=lO n=20 n=30 n=40

JBT 2 3 2 2 JBT 3 4 5 5 TAK 4 6 4 5 TAK 6 6 6 6 007 1 1 3 3 007 2 2 3 3 12A 5 4 6 6 12A 1 1 1 2 JOH 6 5 5 4 JOH 5 5 2 1 BUQL 3 2 1 1 BUQL 4 3 4 4

Cauchy n=lO n=20 n=30 n=40 JBT 4 5 5 5 TAK 5 6 4 3 D07 3 4 6 6 12A 1 1 2 2 JOH 2 2 1 1 BUQL 6 3 3 4 66. Table 9: Average standardised deficiencies (over distributions} Overall n=lO n=20 n=30 n=40 average JBT .58 .87 .88 .99 .83 TAK .91 1.34 .96 1.10 1.08 D07 .45 .81 1.18 1.28 .93 12A 1.19 .93 1.18 1.08 1.10 JOH 1.86 1.56 1.31 .93 1.42 BUQL 1.01 .49 .50 .61 .65

Table 10: Average standardised deficiencies (over sample sizes)

Norma 1 10% C. N . Loa, • s t.1 c L ao 1 ace Cauc h iv JBT . 76 .36 .41 1.19 1.44 TAK .00 .20 1.48 2.39 1.32 D07 1.44 .89 .28 .44 1.61 12A 2.07 1.65 1.64 .05 .06 JOH 1.35 2.62 2.03 .93 .14 BUQL .38 .29 .16 1.01 1.43

Table 11: Number of fifth or sixth rankings n=lO n=20 n=30 n=40 Overall JBT 0 1 2 2 5 TAK 2 3 1 2 8 D07 0 1 2 3 6 12A 3 2 3 2 10 JOH 4 3 2 1 10 BUQL 1 0 0 0 1

Table 12: Number of first or second rankings n=lO n=20 n=30 n=40 Overall JBT 2 0 2 1 5 TAK 1 2 1 2 6 007 3 2 0 0 5 12A 2 2 2 2 8 JOH 1 1 2 2 6 BUQL 1 3 3 3 10 67. way of contrast, 12A and JOH, estimators which assume the worst and guard against it, do relatively better at extreme distributions for small sample sizes. However, unlike TAK, BUQL can be improved.

There are many subjective choices one can make while deriving BUQL. In this study they were all made to suit the requirements of the computer simulation that followed. The sample distribution function

A C(x) was chosen because

(a) among the set of possible estimators of F(x) which had nontrivial derivatives almost everywhere, it required the least computer time to evaluate, (b) it was piecewise linear, thus making many integrals involving F(x) more mathematically tractable, (c) its inverse was also piecewise linear, therefore also easy to manipulate theoretically and computationally.

Its disadvantages were that

(a) it was not robust for F(x), · being sensitive even in large samples to local fluctuations, (b) it required the assumption of a finite support for F(x) and hence the need to estimate the endpoints ~ and a , even when they are infinite, which is especially difficult for extreme distributions such as the Cauchy.

Therefore it is not unreasonable to suggest that the use of a better

A estimator of F(x) than C(x) (especially one more robust to outliers) might somewhat improve the performance of BUQL.

The trim for BUQL used in the simulation was invariably two, in the hope of boosting efficiency at the Normal. However, the 68. simulation results showed that BUQL is naturally very efficient at the Normal and using a trim of three or four might benefit its efficiency at the Cauchy without unduly affecting its efficiency at the Normal.

The theoretically correct method of selecting the optimal coefficients is to perform a constrained minimisation as follows :

z-t " ) minimise MSE ( l c. EZ { X( i ) } subject to i=t+l l

z-t cz+l-i = ci and l c. = 1 i=t+l l

In the simulation, only an approximation to this minimisation procedure was used in order to save computer time.

Finally, the hypothetical sample size was restricted to z = 17 for reasons of computational accuracy. This restriction was investigat­ ed in detail and the results showed that large increases in efficiency are possible if larger hypothetical samples are used. The same simu­ lations were run using the Cauchy distribution with 2000 samples, trim of two at each end, sample sizes n = 10 and n = 20 and hypothetical samples sizes z = 5, z = 9, z = 13 and z = 17. The results, expressed as percentage deficiencies are given in Table 13 below.

Table 13: % deficiencies for BUQL for various actual and hypothetical sample sizes

z n 5 9 13 17 10 92.23 76.23 48.29 38.16 20 78.42 24. 71 11.93 9.24

In each case the improvements are dramatic and indicate much scope for further improvement. Similar results were encountered at the other 69.

distributions, which had the same trend, but not as dramatic an improvement. Overall it is clear that much efficiency can be gained by increasing the hypothetical sample size, especially when the actual sample size is small.

In summary, we have found that BUQL is the most robust estimator available. Its general perfonnance can be improved by some of the above suggestions, especially for small sample sizes from the extreme distributions. Detailed examination of these potential improvements became unnecessary because of the development of the "super-estimator", which we will now introduce.

A linear estimator of µ can be written as

n µ = . l Ci X( i) 1=1

When n is even and c., = c n+l _,.

µ = nf2 2c. ( x(i)+X(n+l-i} ) i=l 1 2

Now, (X(i) + X(n+l-i))/2 is a LT for µ , since its expected value is µ and hence µ is just a compound LT for µ. We may also rewrite µ" as

µ =

where m = n/2 a.1 = 2c. , and x( i) + x(n+l-i) " = 2

This emphasises the fact that the estimator µ" is a linear combination of several simpler estimators µ(i)" . The next step is to generalise to 70.

"'* m "' µ = l a. µ. (18) i=l 1 1 where m need not be n/2 , {ai} are arbitrary constants and {µi}"' are any LT's or estimators for µ. In particular, a sensible choice of the {µi} would be the five estimators from the PRS used in the simulation; that is, JBT, TAK, 007, 12A and JOH. Because this new estimator usesas components not raw data but several sophisticated estimators, it may be labelled a "super-estimator" . This idea is essentially identical to the idea of a compound LT in parametric MMSEE. The optimal coefficients may be found as before: v- 1 1 a * = , - - where V is the covariance matrix of the vector of estimators There is no easy analytical way to find estimates for the components of V as we did before. It is necessary to run a sub-simulation of substantial size, drawing many samples of size z from the original sample distribution function and to estimate the required covariance components from their average values in the sub- simulation. As was necessary with BUQL, we replace µi"' in (18) by Ez{µi}"' "' , which is found from the sub-simulation.

It was intended to test the super-estimator in the same way that BUQL was tested, but it was found that enonnous amounts of computer time were required, even for small simulations. Nevertheless, some small simulations (100 samples per simulation rather than 2000) showed that the super-estimator behaved much like BUQL, except that it was never among the worst behaved, even at the Cauchy and that, as with BUQL, the efficiency of the super-estimator improved greatly for larger hypothetical sample sizes. 71.

The limited simulation results suggest that the super-estimator is not strongly influenced by the worst features of its component esti­ mators and this is what is required of a robust estimator.

Our super-estimator is related to Switzer's [20] estimator whi eh arises from the idea 11 ••• :that. :the .6ample wel6 .6hould be cu,ed t:.o dl6.tlngul6h wh.lch one 06 .6eveJc.al. c.ompe.tlng e.6UrM.t.olt6 l6 mo.6.t

e66,lc.len.t, 6oh. :the unknown f 6Jt.om wh.lch the .6ample Wa.6 dlt.awn. To be able .to cu,e :the .6ample ,in thl6 way Jt.eqUVLe.6 that. :the c.ompe:ti.ng a e.6Umat.olt6 be .6uch that. :the..i.Jt. .6.ta.ndtvr.d eMOlt6 can al&o be e.6:ti.mated w..i.:thout mak.lng cu,e 06 the unknown .6hape f;"

There are other improvements which may be made to the super­ estimator apart from larger hypothetical sample sizes and more robust estimators for F{x}. We should include among our component estimators not just the most robust estimators for the range of distributions from the Nonnal to the Cauchy, but also those such as the sample midrange which are good for distributions with low kurtosis and even parametric estimators for all the common distributions. In short, we should average over as many unbiased estimators of the centre of a symmetric distribution as possible, since each may have some useful features. The sizes of the coefficients of the estimators are detennined by their covariances, which are estimated using a sub-simulation from the esti­ mated distribution function.

By applying the method of MMSEE to a super-estimator with as many component estimators as possible, we greatly reduce the incidence of extreme behaviour and this is the aim of robust methods.

The estimation of the centre of a symmetric distribution is not the only distribution-free problem which we can solve using the method 72.

of MMSEE. We may write the for the previous problem as

Y = µ+E. ; i = 1,2, ... ,n i 1

where yi is a data value, µ is the centre parameter and Ei is

an error term with distribution symmetric about zero. By giving µ a different structure, we may solve more complex statistical problems.

If we let µ=a+. a x. 1 , where a, a are unknown parameters and {xi} is a set of known constants, we obtain the simple linear regres- sion problem.

There are numerous estimators for a and a , including Least Squares, Huber's M-estimators, slice regression, standardised sum and difference, bivariate trimming, etc. We may treat each one of these estimators as a LT for the corresponding parameter and construct

a II super-estimator" for each parameter as a 1inear combination of the

LT's just as we did for the super-estimator of µ in the original problem: m m A* A A* A a = C. a. ; = l d. a. (19) l 1 1 a 1 i=l 1=. 1 1 The optimal values for -C and -d are, as before, v- 1 1 w- 1 1 * * C = - and d = {20) 1T v-l 1 1T w-l 1 where V is the covariance matrix of the vector of estimators

<~1, ; 2, ... ,;m> and w is the covariance matrix of Cs 1, s2, ... ,~m>

The estimation procedure for the components of J_ and W is rather more complex than before. Starting with robust estimators, say 73.

Huber's M-es timators a and 8 , of a and 8 , we compute the residuals

€,. = y. - (a+ x.); i = 1,2, ... ,n. 1 a 1

We now use these residuals to estimate the distribution function F(€) of the errors {€.} , using as before. We may then draw many 1 e samples of size n from this distribution estimate and use the values of the coefficient estimates computed from these samples to estimate

A A A A Cov{ai, aj} and Cov{Bi, Bj} in the usual empirical way. This is essentially Efron's bootstrap procedure ([7J , pp. 17-18), but using an improved distribution estimator which is applied to find the joint sampling distribution of several estimators; that is, a multi-estimator bootstrap. The estimates of y_ and W may be used in (20) to estimate c* and d * , which may in turn be used in (19) as weightings for the original parameter estimates from the sample (al' &2,··· ,~) ,... ,... ,... "* "* and (8 1, a2, ... , Bm) to produce optimal estimates a and a . Finally, these optimal estimates can be used to find improved residual estimates

; i = 1 ,2, ... ,n and this process repeated until convergence occurs.

This technique is easily generalised to multiple regression and other similar problems.

Essentially, this is not a new estimation method, but an optimal way to choose among the many existing estimators, adapting to the error structure indicated by the data. 74.

Appendix: Estimation of Endpoint Parameters

The endpoint parameters of a distribution are the bounds (lower and upper) which delimit the range of the random variable defined by the distribution. If the form of the distribution is not known, it is not unreasonable to assume that its endpoint parameters are finite and can be estimated from a random sample from the distribution. The interval [~, 8] is sometimes referred to as the support of the distribution function F(x) , since outside this range the density f(x) is assumed to be i denti ea l ly zero.

The method of MMSEE suggests that a reasonable form for an estimator of an endpoint parameter is a linear connination of order sta­ tistics and indeed all previous work in this area centres on the choice of the appropriate coefficients for this linear combination (Robson and Whitlock [18], Cooke [4J). The present work is, fundamentally, an improvement on Cooke's method, using a different estimate for F(x) .

MMSEE suggests that we should estimate ~ or e by the appropriate extreme order statistic, minus an estimate of its bias.

Integrating by parts, we have

0 E{X(n)} = J x n Fn-l(x) f(x)dx ~

0 = 0 J Fn(x)dx (A-1) ~

Cooke [4] uses the usual empirical distribution function as an estimator for F(x), but it is clear from Read [17] that the following estimator C{x) is a better, albeit more complicated, estimator: 75.

for i = 0,1,2, ..• ,n , lli = \i+l) - X(i) , X(O) and \n+l) are

defined to be ~ and 8 respectively,

= i + X - X(i) c(x) ,. n+ 1 {n+ 1) ll • 1

A Substituting C(x) for F(x) in (A-1) and simplifying, we obtain

h(l) ~ + h(n) e = X(n) - Sn (A-2) . n+l 1 where h(i) = ln+tJr. ' , a(i) = h(i+l) - 2h(i) + h(i-1)

n and = Sn . l a(i)X(.)1 1= 1

We now follow a similar procedure for ~ to obtain

h(n) ~ + h(l) e = x(l) - s1 (A-3)

n where sl = a(i) x(n+l-i) 1=1.r .

Solving (A-2) and (A-3) simultaneously for the estimates of ~ and e , we have

A h(n} (X{l~-s 1) - h(l) (X(n)-Sn) ~ = , h (n) - h2(1)

,... h(n) (X{n)-Sn) - h(l)(X{l)-Sl} 0 = h2(n) - h2( 1)

It is easy to show that these estimators are location-invariant and in the case where X has a Uniform distribution, they are unbiased. 76.

Bibliography

[1] Andrews, D.F. et al. (1972). "RobU6.t E.6:ti.mltQ,6 06 LocaUon-SWLvey.6 a.nd Adva.nc.e.6". Princeton University Press, New Jersey.

[2] Arnold, B.C. (1970). "Inac:bni.6.6ibili,ty 06 :the U.6u.ai. Sc.ale E.6.tlmah. 601t. a. Shi6,ted Exponen.tlal V-l-6:tlu.bution". Journa 1 of the American Statistical Association, Volume 65, pp. 1260-1264.

[3J Brewster, J.F. {1974). "AUvc.n.a.:ti..ve e.6.ti.ma:to)U) 601t. :the .6cal.e pa.Jt.a.- me.telt. 06 :the Exponen.tlal dl6:tlu.bution wUh unknown l.oc.a..tlon". The Annals of Statistics, Volume 2, pp. 553-557. [4] Cooke, Peter ( 1979). "St:aru.tlc.al In6e1t.enc.e 601t. bou.nd.6 06 Jt.a.ndom va.lLUlbl.u". Biometrika, Volume 66, pp. 367-374. [5] · Cox, D.R .. and Hinkley, D. V. (1974). "Theolt.e.tlcal. St:afu.tleh". Chapman and Hall, London, pp. 265-267. [6] Drastik, V.C. {1982). "Minimum Mea.n Squ.aJte.d EJVWJt. £6:tlma,tlon". Paper to the 6th Australian Statistical Conference, Melbourne University, August 1982.

[ 7] Efron, B. (1979) . "800.t.6.tltap Me:thod.6: Ano:thelt. look a.t .the Ja.ckkn-l6e". The Annals of Statistics, Volume 7, pp. 1-26.

[8] Gerald, C.F. (1970). "Applied NumeJLi.cat. Ana.l.y-6-l.6". Addison-Wesley, USA, pp. 94-110.

[9] Hogg, R. V. (1967). "Some ob.6e1t.va..tlon6 on Jt.ObU6.t u:ti.ma..tlon". Journal of the American Statistical Association, Volume 62, pp. 1179-1186.

[10] Hogg, R. V. ( 1974). "Ada.p.tlve RobU6t PJt.Oc.ec:.lwr.e.6: A PaJLtiai. Review a.nd Some Su.ggu.tlon.6 601t. Fu.twLe App.U..c.a..tlon.6 a.nd Theo1r.y". Journal of the American Statistical Association, Volume 69, pp. 909-927.

[11] Johns, M.V., Jr. (1974). "Nonpa.Jt.a.meruc. u:ti.ma..tlon 06 l.ocailon". Journal of the American Statistical Association, Volume 69, pp. 453-460.

[12] Kendal 1, M.G. and Stuart, A. ( 1961). "The Adva.nc.ed TheoJr.y 06 S.tatl-6.tleh". Volume 2, Third Edition, Griffin, London, p.33, Exercise 17.16 . [13] Kendall, M.G. and Stuart, A. (1973). "The Adva.nc.ed TheoJr.y 06 S~.tlCJ.i". Volume 2, Third Edition (3 volume edition), Griffin, London, pp. 21-22.

[14] Lloyd, E.H. (1952). "Lea6t-.6qu.aJr.U u.tlmat:1..on 06 loc.a..tlon a.nd -6cal.e pa.Jta.meteJr..6 U6-ing 01t.de1t. t,t.a.,tv.,.tleh". Biometrika, Volume 39, pp. 88-95. 77.

[15] Markowitz, E. (1968). "M-i.nhnwn mean-.&qu.aJLe-eNWlt. u.ti..maUon 06 :the .&.ta.ndaJu:L ·dev.iati.on 06 :the No)(,nal. cl..i6:tlu.bu:tion". American Statistician, Volume 22, Number 3, p. 26. [16] Pitman, E.J.G. (1938). "The u.tlma.t.ion 06 the locaUon and .&ea.le pM.ame.tellh 06 a c.onUnu.ou& popui.a:ti.on 06 any g.iven 6oturl'. Biometrika, Volume 30, pp. 391-421. [17J Read, P.R. (1972). "The a.&ymp.to-ti.c. .inadmw.&.ib.iU.ty 06 :the .&le cl..i6.tlubuti.on 6unc.t.ion". The Annals of Mathematical Statistics, Volume 43, pp. 89-95. [18] Robson, D.S. and Whitlock, J.H. (1964). "E.&.tima-ti.on 06 a .tlt.uncaUon po.i~'. Biometrika, Volume 51, pp. 33-39. [19] Sarhan, A.E. ( 1954). "Ehtimat.ion 06 :the mean and .&.ta.n.daJr.d dev.ia-ti.on by 01r.. de1t. .&.ta.-ti.-6-ti.c..6". Annals of Mathematical Statistics, Volume 25, pp. 317-328. [20] Switzer, P. (1972). "E66.ic.iency Robu.6.tnu.& 06 E-6.tima.tolt.6". Proceedings of. the 6th Berkeley Symposium on Mathematical Statistics and Probability, June-July 1970, Volume 1, pp. 283-291.

[21] Takeuchi, Kei ( 1971). "A Uni..6ollmly A.6ymp.to-ti.ca1..ltj E66.ic.ien.t E.&-ti.ma.tolt. 06 a LocaUon PaJr.ame.telt.". Journal of the American Statistical Association, Volume 66, pp. 292-301. [22J Thompson, J. R. (1968). "Some .&hlunk.age .tec.hn.iquu 601t. u:ti.mat,,i.ng :the mean". Journal of the American Statistical Association, Volume 63, pp. 113-122.