A Stochastic Approximation Method Herbert Robbins

A Stochastic Approximation Method

Herbert Robbins; Sutton Monro

The Annals of Mathematical Statistics, Vol. 22, No. 3. (Sep., 1951), pp. 400-407.

Stable URL: http://links.jstor.org/sici?sici=0003-4851%28195109%2922%3A3%3C400%3AASAM%3E2.0.CO%3B2-W

The Annals of Mathematical Statistics is currently published by Institute of Mathematical Statistics.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/ims.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.org Mon Aug 27 15:50:48 2007 A STOCHASTIC APPROXIMATION METHOD'

University of North Carolina I. Summary. Let iM(x) denotc the expected value at level x of the response to a certain experiment. ill(z) is assumed to he a monotone function of x but is unkno~vnto the experimenter, and it is desired to find the solution x = 0 of thc equation ilf(z) = a, where a is a given constant. We give a method for making successive experiments at levels .rl,x2 , . . . in such a way that x, will tend to 0 in probability.

2. Introduction. Let M(x) be a given function and a a given constant such t,hal; the equ a t'10x1

has n unique loot n: = 0. Therc are many methods for determining the value of 8 by successive approximation. With any such method n-e begin by choosing one or more values 1'1 , . . . ,z,more or less arbitrarily, and then successively obtain new values 2, as certain fucctions of the previously obtained xl, . . ,x,-I ,the values .I(). .ill(.r,-I),and possibly those of the derivatives M'(q), - - . ,M'(x,-~), etc. If

lim x, = 0, n+w irrespective of the arbitrary initial values xl. . , z, , then the method 1s effective for the particular function M(x)and value a. The speed of the convergence in (2) and the ease \kith which the oncan he computed determine t,h~ practical utility of the method. We consider a stochastic generalization of the above problem in which the nature of the function M(x)is unknown to the experimenter. Instead, we suppose that to each value z corresponds a random variable Y - Y(x)with distribution function Pr[Y(x)5 y] - H(y I x),such that

1s the expected value of Y for the given x. Neither the exact nature of H(y / x) nor that of M(x)is known to the experimenter, but it is assumed that equation (1) has a unique root 0, and it is desired to estimate 0 by making successive observations on Y at levels x1, x2 , . . . determined sequentially in accordance with some definite experimental procedure. If (2) holds in probability irrespective of any arbitrary initial values xl, . . . ,x, ,we shall, in conformity with usual statistical terminology, call the procedure consisfcnf for the given H(y / x) and vali~rn

This work was upp ported In part by the Office of Naval R~srtirch 400 STOCHASTIC APPROXIMATION 401

In what follows we shall give a particular procedure for estimating 8 which is consistent under certain restrictions on the nature of H(y I x). These restrictions are severe, and could no doubt be lightened considerably, but they are often satisfied in practice, as will be seen in Section 4. No claim is made that the procedure to be described has any optimum properties (i.e. that it is "efficient") but the results indicate at least that the subject of ~t~ochasticapproximation is likely to be useful and is worthy of further study.

3. Convergence theorems. We suppose henceforth that H(y 1 x) is, for every A, a distribution function in y, and that there exists a positive constant C such that

(4) Prl 1 Y(d 15 CI = dH(ylx1 = 1 for all .r. -C It follows in particular that for every x the expected value M(x) defined by (3) exists and is finite. We suppose, moreover, t,hat there exist finite constants a, 0 such that. (5) M(z) 5 a for x < 8, M(x) 2 cu for z > 0.

Whether M(8) =; a is, for the moment, immaterial. Let (a,] be a fixed sequence of positive constants such that

We define a (nonstationary) Markov chain (xn}by taking XI to be an arbitrary constant and defining where y, is a random variable such that (8) pr[yn I y 1 xnl = H(Y1 xn).

(9) b, = E(x, - 8)2 We shall find conditions under which lim bn = 0 n-rw no matter what the initial value xl . As is wall known, (10) implies the convergence in probability of x, to 8. From (7) we have 402 HERBERT KOHBINS .4XD SUTTON MONK0

Setting d. = E[(zn- O)(IM(x.) - a)],

we can write 2 (14) bn+l - bn = anen - 2u, d,, . Note t'hai from (5) dn 2 0, while from (4)

0 5 en I [C + (aj12 < 03. Together with (6) this implies that the positive-term series ~Ia:e, converges. Summing (14)we obtain

Since 1,+,2 0 it follows that

Hence t,he positive-term series

converges. It follows from (15)that

linl b,, = b, + he. - 2 2 a, dn = b 11-01 exists; b >. 0. Now suppose that tllere exists a sequence {kn}of nonnegative constants such tjhat

(19) tl,, 2 k, b,,, tank. = m. Fl,onl t#l~efirst pa,rl ol' (19) and the convergence of (17)it follo~~st,hat

E'rorn (20) a~ltlthe secolid part of (19)it follows that for any e > 0 there must exist infinitely many values 12 such that b. < E. Since we already kno~vthat b =; lim .,,,, b,, exists, it follo\vs that b = 0. Thus Itre have proved LEMMAI. Ij n sequence (k,,)of nonnegative constants esist.s satisfying (19) then b = 0. Let then from (4) and (7) it, follows thst (22) Pr[l 2,,- 6 I < A,] = I.

- k = i,,f ["::TI---.---- for 0 < 1.7. - 01 5 A,. From (5) it follow that L,, 2 b. Moreover, denoting by P,(z) the probahilit,y distribiition of x, , we have

.I,, = i (., - 6) (A/(a) - a) CIP,,(.I.) ir Rl

It follons that the particu1:tr sequcncc3 (z,,) defined by (23) sat isfics the first part of (19). 111 orclel- to c,st:?hlish the second part of (39) nr\'rshall m:ilir the following assnmptinns:

for some const,ant I( > 0 and suficient,ly large )1, ~.nd m

It follo~vsfrom (26) that

ancl 11rnc.rfor srifric.ic~ntI?-1:trgc n

(28) 2lc -t 1 Oijj(~11+ ... + O,? I) 2 '1, . This iinplic+ t)y (2.7)tlinf for \1ifii(.ic~i?1I;\. 1:ir.g~tr

:ind i hc. ~ccoudpart, (11 (19) folloi~s froin (291 :111(1 (26). '1.h;~: 111~ovr:- J,mrv z U,f(25) (~nf?(26) hold ihrn h = 0. 404 HERBERT BOBBINS AND SUTTON MONRO

The hypotheses (6) and (26) concerning {a,} are satisfied by the sequence a, = l/n, since

More generally, any sequence {a,] such that there exist two positive constants c', C" for which

will satisfy (6) and (26). We shall call any sequence {a,} which satisfies (6) and (26),whether or not it is of the form (30),a sequence of type l/n. If {a,} is a sequence of type l/n it is easy to find functions M(x)which satisfy (5)and (25). Suppose, for example, that M(x)satisfies the following strength- ened form of (5): for some 6 > 0, (5') M(x)O. Then for 0 < I x - 0 I 5 A, we have

so that 6 5, 2 - A,' which is (25)with K = 6. From Lemma 2 we conclude THEOREM1. If (a,] is of type l/n, if (4)holds, and if iM(x) satisfies (5') then b = 0. A more interesting case occurs when iM(x) satisfies the following conditions: (33) M(x) is nondecreasing,

We shall prove that (25)holds in this case,also. From (34) it follows that

(36) M(x) - a: = (X - e)[Mf(e)+ e(x - e)], where r(t)is a function such that

Hence there exists a constant 8 > 0 such that (38) r(t)> --&M'(O) for /tI58, STOCHASTIC APPROXIMATION so that

f-Ience, for 6 -+ 5 < 3: < 0 + A, , since M(x) is nondecrcasing,

Thus, sinre we may assunle without loss of generality that 6/A, 5 1,

so that (25) holds with K = 6M1(0)/2 > 0. This proves THEOREM2. If {a,} is of type l/n, if (4) holds, and if M(x) satisJies (33), (34), and (35), then b = 0. It is fairly obvious that condition (4) could be considerably weakened without affecting the validity of Theorems 1 and 2. A reasonable substitute for (4) would be the condition

m (4') 1 M(T) i < C, Sm(y - ~(2))~d ~ / x)( < ~u2 < for all rc. We do not know whether Theorems 1 and 2 hold with (4) replaced by (4'). Likewise, the 'hypotheses (33), (34), and (35) of Theorem 2 could be weakened somewhat, perhaps being replaced by

(5") M~X)< for L < e, M(X) > for x > e.

4. Estirnatiori of a quantile using response, nonrespoxlse data. Let F(x) be an unknown distribution function such that (431 F(0) = a: (0 < a < I), E"(0) > 0, and let (z,) be a sequence of independent random variables each with the distribution function Pr[z, 5 x] = F(x). On the basis of {z,} we wish to estimate 8. However, as sometimes happens in ptnctice (bioassay, sensitivity data), we are not allowed to know the values of zn themselves. Instead, me are free to prescribe for each n a value x, and arc then given only thc values (y,,} where 11 ifzn

Let us proceed as follows. Choose xl as our best guess of the value 0 and let (a,] be any sequence of constants of type I /n. Then choose vall~esx2 ,x3, . . . seqilentially according to thc rule

(46) i'r[y. = 1 / x,] = F(x,), I'r[yn = 0 / x,,] = 1 - F(x,), it follo~\rsthat (4) hold3 and that

All the hypotheses of Tileorem 4 are satisfied, so that

lim 2, = 8 n-tw

11 quadratic nlcan ancl hence in probability. In other words, (z,)is a consistent estimator of 0. The eficiency of {x,] will depend on xl and on the choice of the sequence {a,),as well as on the nature of F(x). For any given F(x) there doubtless exist more efficient estimators of 0 than any of the type {x,,) defined by (45), but {x,] has the advantage of being distribution-free. In some applications it is more convenient to make a group of r observations at t,he same level before proceeding to the next level. The ?~thgroup of ohserva- t,ions will then be

using the iotat ti on (44). J,ct fj, = arithmetic mean of the values (49). Then setting

we havc df(x) = P(.c) as before, and hence ($8) continues to holcl. The possibility of using a convergent sequential process in this proble~llwas first mentioned by T. W. Anderson, P. J. McCarthy, and J. JIT.Tukey in the Naval Ordnance Report No. G5-46(1946), p. 99.

5. A more general regression problem. It is clear that the problenl oi dectio~l4 is a special case of a more general regression problem. In fact, using the notation of Section 2, consider any random variable Y which is associated with an observ- able rnlue .2: in such a way that the conditional distribution function of Y for fixed x is H(y ( a);the function M(x) is then the regression of Y on x. The usual regrcqiion analysis assumes that M(x) is of known forrn with unknown psr:~rnc.tc>rb,S,IT~ STOCHASTIC APPROXIMATION 407 and deals with the estimation of one or both of the parameters Pion the basis of observations y, , yz , . . . , y, corresponding to observed values XI , 22 , . . . , x, . The method of least squares, for example, yields the estimators bi which minimize t'he expression

Instead of trying to estimate the parameters Pi of M(x) under the assumption that M(x) is a linear function of x, we may try to estimate the value 0 such that M(0) = a, where cx is given, without any assumption about the form of M(x). If we assume only that H(y I x) satisfies the hypotheses of Theorem 2 then the sequence of estimators {x,] of 0 defined by (7) will at least be consistent. This indicates that a distribution-free sequential system of making observations, ~uchas that given by (7), is worth investigating from the practical point of view in regression problems. One of us is investigating the properties of this and other sequential designs as a graduate student; the senior author is responsible for the convergence proof in Section 3.