Bandit Processes and Dynamic Allocation Indices Author(s): J. C. Gittins Source: Journal of the Royal Statistical Society. Series B (Methodological), Vol. 41, No. 2 (1979), pp. 148-177 Published by: Blackwell Publishing for the Royal Statistical Society Stable URL: http://www.jstor.org/stable/2985029 Accessed: 10/12/2010 20:37

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=black.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Royal Statistical Society and Blackwell Publishing are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society. Series B (Methodological).

http://www.jstor.org J. R. Statist.Soc. B (1979), 41, No. 2, pp. 148-177

Bandit Processes and Dynamic Allocation Indices

By J. C. GITTINS Keble College,Oxford

[Read beforethe ROYAL STATISTICAL SOCIETY at a meetingorganized by theRESEARCH SECTION on Wednesday,February 14th, 1979, the Chairman Professor J. F. C. KINGMAN in theChair] SUMMARY The paperaims to givea unifiedaccount of thecentral concepts in recentwork on banditprocesses and dynamicallocation indices; to show how thesereduce some previouslyintractable problems to the problemof calculatingsuch indices;and to describehow these calculationsmay be carriedout. Applicationsto stochastic scheduling,sequential clinical trials and a class of searchproblems are discussed. Keywords:BANDIT PROCESSES;DYNAMIC ALLOCATION INDICES; TWO-ARMEDBANDIT PROBLEM; MARKOVDECISION PROCESSES; OPTIMAL RESOURCE ALLOCATION; SEQUENTIAL RANDOM SAMPLING;CHEMICAL RESEARCH; CLINICAL TRIALS; SEARCH 1. INTRODUCTION A schedulingproblem There are n jobs to be carriedout by a singlemachine. The timestaken to processthe jobs are independentinteger-valued random variables. The jobs mustbe processedone at a time. At the beginningof each timeunit any job may be selectedfor processing, whether or not the job processedduring the precedingtime unit has been completed,and thereis no penaltyor delayinvolved in switchingfrom one job to another.The probabilitythat t + 1 time unitsare requiredto completethe processingof job i, conditionalon more than t timeunits being needed, is pi(t) (i = 1,2, ..., n; te Z). The rewardfor finishingjob i at times is as Vi (O< a <1; ViJ>>0,i = 1,2, ..., n), and there are no other rewards or costs. The problem is to decide whichjob to processnext at each stage so as to maximizethe total expectedreward.

A multi-armedbandit problem There are n armswhich may be pulled repeatedlyin any order. Each pull takes one time unit and only one arm may be pulled at a time. A pull may result in either a success or a failure. The sequence of successes and failures which result from pulling arm i forms a with an unknown success 6i (i = 1,2, ..., n). A successful pull on any arm at time t yieldsa rewardat (O< a < 1), whilstan unsuccessfulpull yieldsa zero reward. At timezero O4has the probabilitydensity

(o?e(O)?+f3(O)+1)! (4(0)! Pi(O)!)-l 6ii(O)(l- o i.e. a beta distribution with parameters (oci(O),/3i(0)), and these distributionsare independent forthe differentarms. The problemis to decide whicharm to pull nextat each stage so as to maximizethe total expectedreward from an infinitesequence of pulls. From Bayes' theoremit followsthat at everystage Oi has a beta distribution,but with parameterswhich change at each pull on arm i. If in the firstt pulls thereare r successes,the new values of the parameters,which we denote by (cxi(t),f3i(t)), are (U'i(O)+ r,fi(O) + t- r). If the (t + 1)st pull on arm i takes place at times, the expectedreward, conditional on the record of successes and failures up to then, is as times the expected value of a beta variate with parameters(oii(t), pi(t)), whichis (oci(t)+ l)/(oti(t)+ pi(t) + 2). Both theproblems described above involvea sequenceof decisions,each of whichis based on moreinformation than its predecessors, and thusboth problems may be tackledby dynamic 1979] GirrINS- BanditProcesses and DynamicAllocation Indices 149 programming(see Bellman,1957). This is a computationalalgorithm based on the principle that,"an optimalpolicy has the propertythat whateverthe initialstate and initialdecision, theremaining decisions must constitute an optimalpolicy with regard to thestate resulting from thefirst decision". This observationmeans thatif the optimalpolicy from a certainstage (or time)onwards is known,then it is relativelyeasy to extendthis policy so as to givean optimal policystarting one stageearlier. Repetitionof thisprocedure is the basis of an algorithmfor solvingsuch problems,which is oftendescribed as a processof backwardsinduction. A simplerprocedure than backwardsinduction is at each stage to make that decision whichmaximizes the expectedreward before the nextdecision time. This procedurewill be termeda one-steplook-aheadpolicy, following the terminology used by Ross (1970) forstopping problems. The idea is that each decisionis based on what may happen in just one further timeunit or step. The notionof a one-steplook-ahead policy may be extendedin the obvious way to form s-steplook-ahead policies. In generalsuch policiesperform better as s increasesand approach optimalityas s tendsto infinity,whilst the algorithms to whichthey lead becomeprogressively morecomplex as s increases. As a furtherextension of an s-steplook-ahead policy we may allow the numberof stepsT whichwe look ahead at each stageto dependin an arbitrarymanner on whathappens whilst those steps are takingplace, so that -ris a randomvariable. Given any rule for takingour sequenceof decisions,r may be chosenso as in some senseto maximizethe expectedrate of rewardper step forthe next r steps. A second maximizationwith respect to decisionrules selectsa decisionrule. Our extendedlook-ahead policy starts by followingthe decisionrule just describedfor the randomnumber of steps r. The processof findinga decisionrule, and a correspondingrandom number of furthersteps r', is thenrepeated with respect to the state reachedafter the firstr steps. The new ruleis followedfor the nextr' steps,and the process may be repeatedindefinitely. In thisway a rule is definedwhich specifies the decisionto be made at everystage. Such a rulewill be termedaforwards induction policy, in contrastwith the backwardsinduction of dynamicprogramming. A formaldefinition is givenin Section 3. Forwardsinduction policies are optimalfor a class of problems,which includes the two problemsdescribed above, in whicheffort is allocatedin a sequentialmanner between a number of competingcandidates for that effort,a resultwhich will be describedas theforwards inductiontheorem. These candidateswill be describedas alternativebandit processes. From the optimalityof forwardsinduction policies it followsthat a dynamicallocation index (DAI) may be definedon the statespace of each banditprocess, with the propertythat an optimal policymust at each stageallocate effortto one of thosebandit processes with the largestDAI value. This resultwill be describedas the DAI theoremand the policyas a DAI policy. The proofsof theseresults will be publishedseparately (Gittins, 1979). The existenceof a functionwith this property, and the factthat it may be writtenin the formused here,were proved in earlierpapers (Gittins and Jones,1 974a; Gittinsand Glazebrook, 1977) withoutusing the concept of a forwardsinduction policy, and the particularcases discussedin the presentpaper depend only on theseresults. The approach via the forwards inductiontheorem has the advantagethat it is intuitivelyplausible that such a resultshould hold, and it leads naturally,as we shall see, to the generalfunctional form of the dynamic allocationindex. Moreover,the forwardsinduction theorem continues to hold underappro- priateconditions, and essentiallythe same proofworks, if bandit processes arrive in a random manner,or are subjectto precedenceconstraints. This leads to resultsanalogous to the DAI theoremin thetheories of priority queues and ofmore complex stochastic scheduling situations. Some of theseapplications have been describedby Nash (1973) and Glazebrook (1976a, b), respectively.A morecomplete account, using the simplifying concept of a forwardsinduction policy,will be publishedin due course. Sometimes,too, as shownby Glazebrook (1978a), a decisionproblem may be simplifiedby expressingjust part of the problemin termsof bandit processes. 150 GrrrNS - BanditProcesses and DynamicAllocation Indices [No. 2, In the presentpaper theseextensions are mentionedonly in passing. The aims are: (i) to give a unifiedaccount, in the contextof Markov decision processesand withoutdetailed proofs,of the centralconcepts in recentwork on banditprocesses and DAIS; (ii) to show how these conceptsreduce some previouslyintractable problems to the problemof calculating DAIS; and (iii) to describehow thesecalculations may be carriedout. A banditprocess is definedin Section2, and the main theoremsare formallystated and discussedin Section3. In Section4 the generalfunctional form of the DAI is examinedmore closely,and Section5 shows how thissimplifies under certain conditions. Formulaefor the DAI functionfor the scheduling problem are derivedin Section6. Possibleapplications include the schedulingof jobs on a computerand the allocationof effortbetween competing research projects. A methodof calculating,and the generalform of, the DAI functionfor the multi- armed bandit are described in Section 7. Section 8 describes a method of calculating theDAI functionfor any bandit process. The mainpossibility of applyingthe results of Section7 is in clinicaltrials. DAI functionsfor similar, and sometimesmore realistic, problems for which the resultof each trialis a normallydistributed random variable are discussedin Section9. In Section 10 a variantof the multi-armedbandit problem is consideredin whichthe objectis to minimizethe expectednumber of trialsup to the firstsuccess, rather than to maximizethe expectedvalue of an infinitestream of successesand failures. Once again a versionof the problemfor which the distribution of scoreson each trialis normallydistributed is of interest, as well as the Bernoullitrials version. This problemhas possibilitiesof applicationto the screeningof chemicalsin pharmaceuticalresearch. For the sake of simplicityattention is restrictedto discrete-timebandit processes. Every resultmentioned here also has a continuous-timecounterpart, which may be obtained by lettingthe discrete-timequantum tend to zero in an appropriatefashion. For example,Nash and Gittins(1977) establishthe continuous-time version of the optimal policy for the scheduling problem,though using a differentmethod.

2. BANDIT PROCESSES All the processesconsidered are indexedby a timevariable whose value set is the non- negativeintegers, which we denote by Z. They are also stationary,i.e. theirproperties involveno explicittime-dependence, and are particulartypes of Markov decision process. It maybe notedthat the assumption of stationarity rules out versionsof the allocation problems consideredwith finite time horizons. The reason for the restriction(see Gittins,1975, and Gittinsand Nash, 1977) is thatDAI policiesare not in generaloptimal in such cases. A Markov decisionprocess is definedon a state-space?, togetherwith a a-algebra . of subsetsof ? whichincludes every subset consisting of just one elementof E). When the processis in statex theset of controls which may be appliedis Q(x). P(A Ix, u) is theprobability thatthe statey of theprocess at timet+ 1 belongsto A (eq), giventhat at timet the process is in state x and controlu (e Q(x)) is applied. Applicationof controlu at time t withthe process in state x yields a rewardadR(x, u) (0< a < 1). The functionsP(A ,u) and R( , u) are I-measurable. A policyfor a Markovdecision process is any rule,including randomized rules, which for all t specifiesthe control to be appliedat timet as a functionof t,the states at times0, 1,2, ..., t, and the controlsapplied at times0,1,2, ..., t-1; we shall describethis by sayingthat the controlat timet is sequentiallydetermined. Deterministic policies are thosewhich involve no randomization. Stationarypolicies are those which involve no explicittime-dependence. Markovpolicies are those forwhich the controlchosen at timet is independentof the states and the controlsapplied at times0, 1,2, ..., t-1. Blackwell(1965) has shownthat if the controlset Q(x) is finiteand the same forall x then thereis a deterministicstationary Markov policy for which,for any initialstate, the total expectedreward is the supremumof the total expectedrewards for the class of all policies. We shallrefer to thisresult as Blackwell'stheorem, and to a policywhich achieves the supremum 1979] GITTINS- BanditProcesses and DynamicAllocation Indices 151 justmentioned as an optimalpolicy. It is assumedthroughout the paper that Q(x) is finite forall x and thatthe supremum of thetotal expected reward is finite.To a largeextent, therefore,attention may be restrictedto deterministicstationary Markov policies. Such a policyis definedby an .-measurablefunction g on E) suchthat g(x) eQ(x), V x. A banditprocess is a Markovdecision process for which Q(x) = {O,1}, V x. The control0 freezesthe process in thesense that P({x} Ix, 0) = 1 and R(x, 0) = 0, V x. ControlI is termed thecontinuation control. No restrictionis placed on thetransition and rewards ifcontrol I is applied.The number of times control 1 hasbeen applied to a banditprocess is termedthe process time. The stateat processtime t is denotedby x(t). The rewardbetween timest and t+ I ifcontrol 1 is appliedat eachstage, so thatprocess time coincides with real time,is daR(x(t),1), whichwe abbreviateto aR(t). A standardbandit process is a bandit processfor which, for some A, R(x, 1) = A,V x. An arbitrarypolicy for a banditprocess is termeda freezing rule. Givenany freezing rule f the randomvariables f(t), te Z, are sequentiallydetermined, where f(t) (>f.(t- 1)) is the numberof times control 0 is appliedbefore the (t + l)stapplication of control 1. Deterministic stationaryMarkov policies divide the state space ? intoa stoppingset, on whichcontrol 0 is applied,and a continuationset, on whichcontrol I is applied. Theyare clearlysuch that f(t) = 0, V t< 'r,and f.(r) = oo,for some sequentially determined random variable r, which maytake the value infinity with positive probability. These properties define a stoppingrule, and r is theassociated . Stopping rules have been extensively studied, for the mostpart in thecontext of stoppingproblems (e.g. see Chow et al., 1971),which may be regardedas beingdefined by banditprocesses for which R(x,0)# 0. Frequentreference will be madeto stoppingtimes. It shouldbe notedthat the definition is as above,and thereis no implicationthat the process concerned actually does stopat sucha time. The followingnotation will be used in conjunctionwith an arbitrarybandit process D. R1(D) denotesthe expected total reward under the freezing rulef. Thus

= E , a*+I(i)R(t), R'(D) = sup R1(D). Rf(D) 1=0 Also

co = E I+f v (D) = RI(D)/W1(D), and v'(D) sup v1(D). WI(D) t~=O {I:fiO)=O) Similarly,for stopping rules, ir-1 7-1 R7(D) = E, JtR(t), R(D) = sup R,(D), W(D) = E at, t=(} T t=O v7(D) = R7(D)/W7(D) and v(D) = sup vr(D). T:>0 From Blackwell'stheorem it followsthat R'(D) = R(D), thatv(D) = v(D) (thoughthis is less obvious)and is an f-measurablefunction of x, and thatstopping times exist for which the respectivesuprema are attained. All thesequantities naturally depend on theinitial state x(O) ofthe bandit process D. Whennecessary Rf(D, x) and Wf(D,x), forexample, will be usedto indicatethe values of R1(D) and Wf(D)when x(O) = x. The quantitiesv1(D) and v,(D) are thusexpected rewards per unitof discountedtime underfand T respectively.The conditionsf(O) = 0 and r> 0 in thedefinitions of v'(D) and v(D) meanthat the policies considered are all suchthat at timezero control 1 is applied.This restrictionis required to ruleout zero denominators Wf(D) or W7(D). In thecase of v'(D) it also has theeffect of removinga commonfactor from the numerator and denominatorof v1(D) forthose f forwhich f.(O) 0, and otherwiseimplies no loss ofgenerality. The classof stoppingtimes {T-> 0} is stationaryfrom time I onwards,rather than from time 0. 152 GITrINS - BanditProcesses and DynamicAllocation Indices [No. 2, For reasons which will become apparent in the next section, in which the forwards inductiontheorem and the DAI theoremare formallystated, v(D,x) is definedto be the dynamicallocation index for the banditprocess D whenit is in statex.

3. THE MAIN THEOREMS We beginwith some furtherterminology and notation. Given any Markov decisionprocess .,, togetherwith a deterministicstationary Markov policyg, a banditprocess may be definedby introducingthe freezecontrol 0 withthe usual properties,and requiringthat at each timet eitherthe control0 or the controlgiven by g be applied. This bandit process is termedthe superprocess(4, g). Thus applicationof the continuationcontrol 1 to (A',g), when X4'is in statex, is equivalentto applyingcontrol g(x) to X&. The idea of a superprocessis due to Nash (1973), who used it to show that the DAI theoremmay be extendedto cover the case when new banditprocesses arrive in a Poisson process. The followingnotation extends that already set up fora banditprocess:

Rgr )-= R((df',g)), Wg4r(#)= WIV((d(,g)),vgr(A = Rgr(A/Wgr(A vg(.G) = sup Vg2(J), V(X4f)= sup vg(). 7>0 9 Since foran arbitrarybandit process D thereis a stoppingtime r forwhich the supremumis attainedin the definitionof v(D) it followsthat the same is trueof vg(.). Also ifthe control set Q(x) is finitefor all x thenBlackwell's theorem may be extendedto show that,for some g, vg(d) = v(), and v(X) is unalteredif g is allowed to rangeover the entire set of policies for 4,. As forbandit processes, v(.4', x), forexample, denotes the value of v(X4') when X4' is initiallyin statex. With this notationwe are now in a positionto give a formaldefinition of a forwards inductionpolicy for the Markov decision process X4', whose state at timezero we denoteby xO. The firststep is to finda policyYi and a stoppingtime a1 such thatthe discountedaverage rewardper unittime of the superprocess(.', g) up to the stoppingtime T ( > 0) is maximized overall g and r bysetting (g, r) = (yr,o1). Thusvy,, (X) = v(X) (= v(-A',XO)). Let xl be the (random) state of the superprocess( y,Yi)at timea,. We now definethe policyY2 and the stoppingtime ur2 to be such thatv 72(1G4, xl) = v(.4', xl). In generalY2 and u2 depend on xl, and are such thatthe discountedaverage reward per unittime of (AI,g) up to r ( > 0) is maximizedwhen (g, T) = (Y2, cr2) if X4'is initiallyin statex1. A forwardsinduction policy for X4'starts by applyingpolicy y, up to time ur,,and then appliespolicy Y2 up to timeal + r2.Let x2 be thestate of X at thisstage, and definey3 and ar3 to be such that v,,7(0X, x2) = v(G1, x2). A forwardsinduction policy continues by applying policyy3 betweentimes a1 + r2and o1+ a2 + c3. Let x3 be the state of /X'at this stage,and definey4 and a4 to be such thatvY4(4(-s x3) = v(, x3). A forwardsinduction polcy applies y4 betweentimes rl + a2+ a3 and a1+ a2+ a3+ ar4. This processmay obviouslybe continuedindefinitely, thus definingthe class offorwards inductionpolicies for the Markov decision processJ4'. There may be more than one such policyfor the same xo since theremay, for example,be more than one y, and a1 such that v>,C (Xf) = v(.X'). The termforwards induction policy is in contrastto a backwardsinduction policy derived fromthe dynamicprogramming optimality principle quoted in Section1. This principleleads to a recurrencerelation which goes backwardsin time (equations (13), (14) and (16) are examples),from which an optimalpolicy may be determinedby backwardsinduction. With a forwardsinduction policy, at each successivestopping time the expectedreward per unit of discountedtime up to the next stoppingtime is maximized,so the policy is definedby a sequence of stepsproceeding forwards in time. The steplength is the sequentiallydetermined _ tim or r = 1, 2, - 1979] GirTiNs - BanditProcesses and DynamicAllocation Indices 153 Forwards inductionpolicies are often easier to determinethan backwards induction policies. However,unless suitable restrictions are put on ./Xthey are not optimal. Fortunately thereis one quite large class of Markov decision processesfor which forwardsinduction policiesare optimal,as well as beingrelatively simple to determine.These are the processes whichmay be regardedas simplefamilies of alternativebandit processes. A familyof alternativebandit processes is formedby bringingtogether a set of n bandit processes,with the constraintthat control 1 mustbe applied to just one banditprocess at a time,so thatcontrol 0 is applied to the othern- 1 banditprocesses. The rewardat timet is the rewardyielded by the banditprocess to whichcontrol 1 is applied at timet. Thus at each stagethe banditprocesses are alternativecandidates for continuation. We shall supposethat there are no constraintsrestricting the set of bandit processeswhich may be chosen for continuationat any time. In the absence of such constraintsa familyof alternativebandit processeswill be describedas simple. We may now state the followingtheorem.

The ForwardsInduction Theorem. For a simplefamily of alternativebandit processes a policyis optimalif and onlyif it coincidesalmost always with a forwardsinduction policy.

In orderto gain some feelingfor why it is thata forwardsinduction policy is optimalfor simplefamilies of alternativeprocesses, but not for all Markov decisionprocesses, consider the problemof choosinga route for a journeyby car. Suppose thereare severaldifferent possible routesall of the same lengthwhich intersect at various points,and the object is to choose thatroute which minimizes the time taken. The problemmay be modelledas a Markov decisionprocess by interpretingthe distanceso farcovered as the "time" variable,the time takento covereach successivemile as minusthe reward,position as the state,and choosinga value just less than one for the discountfactor a. The controlset Q(x) has more than one elementwhen the state x correspondsto a cross-roads,the differentcontrols representing the variouspossible exits. For thisproblem the firststage in a forwardsinduction policy is to finda routeYl, and a distanceo1 along Yi fromthe startingpoint, such that the average speed in travellingthe distanceo1 along Yi is maximized.Thus a forwardsinduction policy might very well startwith a shortstretch of motorway,which then must be followedby a veryslow section,in preference to a trunkroad whichpermits a good steadyaverage speed. The troubleis thatirrevocable decisionshave to be takenat each cross-roadsin thesense that those exits which are notchosen are not available lateron. The distinctiveproperty of a simplefamily of alternativebandit processes is thatdecisions are not in thissense irrevocable,since any banditprocess which is available forcontinuation at some stage,and whichis not thenchosen, may be continuedat any laterstage, and with exactlythe same resultingsequence of rewards,apart fromthe discountfactor. This means thereis no lateradvantage to compensatefor the initial disadvantage of not choosing a forwards inductionpolicy. The firststage of a forwardsinduction policy is such thatthe expectedreward per unitof discountedtime up to an arbitrarystopping time is maximized.For a simplefamily of alterna- tive bandit processesit is intuitivelyplausible, and it can be rigorouslyshown, that this maximumis attainableby a policyunder which just one of thealternative bandit processes is continuedup to the stoppingtime in question. The reason is that if more than one bandit process were to be continuedduring the firststage, then the expectedreward per unit of discountedtime during the firststage would be a weightedaverage of the expectedrewards per unitof discountedtime for each of thebandit processes to be continued.Since a weighted averageis neverlarger than the largestof the quantitiesaveraged it followsthat there is no point in averagingover more than one quantity,i.e. no point in continuingmore than one banditprocess during the firststage. This observationmay be developedas a formalproof. 154 GrTIINS - BanditProcesses and DynamicAllocation Indices [No. 2, Now forany singlebandit process D in statex the maximumexpected reward per unitof discountedtime up to an arbitrarystopping time is by definitionthe DAI, v(D, x). In the light of the previousparagraph it thusfollows that at timezero one of the banditprocesses whose DAI is thenmaximal should be continued.This leads to

The DAI Theorem. For a simplefamily of alternativebandit processes a policyis optimalif and onlyif at each stage thebandit process selected for continuationis almostalways one of thosewhose dynamic allocation index is thenmaximal.

4. A MoRE PRcIsE CHARACTERIZATION oF THE DAI The finalresult of thissection leads to the algorithmdescribed in Section7 forcalculating the DAI for the multi-armedbandit problem. The proofs indicate the kind of argument requiredin provingthe two main theorems. As mentionedin Section2, the DAI fora banditprocess D in statex may be writtenas

v(D, x) = sup [Efa2+f') R(x(t), 1)1x(O) = x /E( at+(t'Ix(O) = x}] (1) (f.y(0 =0) t=o t=o

= supv,(D, x) = supE[2 dR(x(t), 1)1x(O) = xlfEE atI x(O) = xfl (2) T>0 T> t=O t=o The expression(2) uses the fact that the set of freezingrules over whichthe supremumis takenin (1) maybe restrictedto thosewhich, from process time 1 onwards,are determined by a stoppingset 00 and a complementarycontinuation set 0O. We nowproceed to prove thefollowing lemma. Lemma.The supremum in (2) is attainedby setting 0) = {y EQ: v(D, y) < v(D, x)}. Proof.Dropping the condition x(O) = x fromthe notation, we have,for any non-random se Z+ and forany stopping time r,

V,(D, x) = [E 2 R(x(t). 1)+E(Ez a R(x(t),1)I X(s)}] /[E Z at+E{E2 allx(s)}j. (3)

where a = min(s, r), and the inner expectationsin both numeratorand denominatorare conditionalon thevalue taken by the random variable x(s). Now if r>s then

E{ adR(x(t), l)1 x(s)}/E{ at|Ix(s)} = VT8(D, x(s)) < v(D, x(s)). (4)

From (3) and (4) it followsthat if the probability of theevent E8 = {T-> s n v(D, x(s)) < v,(D, x)} is positive,and the random integer p is definedto take the value s whenEs occursand otherwise to equal r, thenvp(D, x) > v,(D,x). Thusif X is suchthat the supremum is attainedin (2) we musthave P(UL=1 Es) = 0. Thisis equivalentto sayingthat the probability that, starting in statex, thebandit process D passesthrough a statewhich belongs to theset defined in the statementof thelemma before process time r is zero. Thus the stoppingset ?0 whichdefines T must include the given set, except perhaps for a subset which is reached before r with probabilityzero. A similarargument shows that P{v(D, x(Tr))> v(D, x) Ix(O) = x} = 0, since otherwise v,(D,x) couldbe increasedby increasing -rin an appropriatefashion for those realizations of D forwhich v(D, x(r)) > v(D, x). A furthersimilar argument shows that vT(D, x) is unaffected bythe inclusion or exclusionfrom ?0 ofstates belonging to theset {y e- : v(D,y)= v(D,x)}. 1979] GrrNs - BanditProcesses and DynamicAllocation Indices 155 From thethree preceding observations and since(i) forsome r, v7(D,x) = v(D, x), (ii) v7(D,x) is unchangedby changes in 00 on setswhich are reached by time r withprobability zero, and (iii)v(D, *) is an 8-measurablefunction, itfollows that 00 maybe chosenas thelemma states. Thiscompletes the proof. A pointto be notedin theproof is that,unlike r, therandom time p is not necessarily definedby a freezingrule which is stationaryor Markovfrom time 1 onwards.However, thisdoes notinvalidate the proof, since p is definedby some freezingrule, and thefreezing rulesin (1) are notrestricted to be stationaryor Markov. Forthe purposes of the algorithm described in Section7 weneed to considerwhat happens whenthe set of stoppingtimes {r > O} is modifiedby allowing the stopping set 00 to depend onthe process time t, and by imposing the restriction r S M, whereM is a non-randominteger. Thisnew set of stopping times will be denotedby {O < X < M} andwe define vM(D,x) = sup vr(D,x). (5) o

00(t) = {ye0: vM4(D,y)

5. IMPROvINGAND DETERIORATTNGDAIs In thissection we describetwo cases for which the definition of theDAI leadsdirectly to an expressionfrom which particular values may be determinedin a straightforwardmanner. Considerfirst any bandit process D, an arbitrarystate of which is denotedby x. DroppingD fromthe notation, we have sR V(X) = SUP VT(X) = SUPRW)= = R(x, 1)+aE{RW(x(l)) I x(0)= x( ,r>O r>O Wr(x) a> 1+ aE{Wc,(x(1))I x(0) = (6) wherei and a are stoppingtimes, and 'ris restrictedto be positive.

Case I (the deterioratingcase): P{v(x(1)) < v(x(O))I x(O) = x} 1 Sincev(x(l)) = supr>O{Rc,(x(1))/W,(x(l))}itfollows immediately from (6) thatv(x) = R(x,1). For Case 1 ourconclusion, then, is particularlysimple. The processfor which R(x, 1) is largestat anyparticular time is theprocess which yields the largest immediate reward if it is continued,and theDAI theoremtells us thatthis is theprocess which should be continued. Thusthe one-step look-ahead policy is optimal.Since Case I coversa situationin whichthe futureprospects of gain from a processare boundto deterioratewhen it is continued,such a conclusionis notunexpected. The deterioratingcase maybe comparedwith the monotone case in thestudy of stopping problems,which is discussedby Chowet al. (1971). Heretoo, and forsimilar reasons, the solutionis particularlysimple. 156 GIrrNS - BanditProcesses and DynamicAllocation Indices [No. 2, It is easy to see that a sufficientcondition for P{v(x(l)) < v(x(O))I x(O) = x) 1 is that P{R(x(t+ 1), 1) < R(x(t), 1)1x(t) = y} 1, for all statesy which may be reached fromx in any numberof steps.

Case 2 (the improvingcase): P{v(x(s)) > v(x(s- 1)) ...> , v(x(l)) > v(x(O))I x(O) = x} = 1, for some non-randominteger s From (6) it followsthat for this case s-I v(x) = sup Et a' R(x(t), 1)I x(O) = xJ+ al E{R,(x(s)) Ix(O) = x} f>0 t=O x [1+a+...+as-1+asE{W,,(x(s))Ix(O) =x}]-l], whichsimplifies if the definingcondition holds for all s and if we set s = Co. This will, for example,be so if the definingcondition holds fors = 1 and forall x. Cases 1 and 2 are illustratedby the scheduling problem.

6. THE SCHEDULING PROBLEM Let D be a bandit process such that E = {C} u Z, P({C} C, 1) , P({C} It,1) =p(t) P({t+ 1}t , 1) = 1 -p(t), R(C, 1) = 0 and R(t,1) p(t) V, V t0. A banditprocess with these propertiescorresponds to one of thejobs in the schedulingproblem described in Section 1. If the banditprocess is in state C thissignifies that the job has been completed.Thus unless the job has reachedstate C its state coincideswith the processtime if x(O) = 0. Also, it is truegenerally that atR(x, 1) may be taken to be the expectedreward if control1 is applied at timet withthe processin statex, and thisdevice has been used here. It maybe notedthat, unlikethe multi-armedbandit problem, the schedulingproblem does not involveprobability distributionswith unknownparameters. However, it is a simple matter(see Gittinsand Glazebrook,1977) to extendthe discussionwhich follows to includethis possibility. If p(t) is a non-increasingfunction of t then D is a deterioratingbandit process, since the sufficientcondition for Case 1 holds forall x. Thus withjobs of the above typethe job to be continuedat any timeis one of thosefor which pi(ti) Vi is largest,where i runsover the set of uncompletedjobs. A job forwhich p(t) is a non-decreasingfunction of t providesan exampleof a modification of Case 2. We now have

P{v(x(s)) > v(x(s-1)) > ... > v(x(l)) > v(x(O)) > ? 1x(O) = x,x(s)+ C} = 1, s E Z, and v(C) = 0. It thusfollows from (6) that

+ a T- v(x) = E{ a'R(x(t),l) I x(O) = x) /E{l + a+ ... Ix(0) x}, (7) wherer = min{s: x(s) = C}. Equation (7) may be rewrittenin the form v(x) = V(1- a) E(aT1 Ix(0) = x)/{l - E(ar Ix(O) = x)}. For an arbitraryjob, withno restrictionon the functionp(t), it is easy to see thatthe r forwhich the supremum in (6) is attainedis no greaterthan the time taken to completethe job. For uncompletedjobs the state coincideswith process time, so that the stoppingset which definesT must be reachedat some non-random(and possiblyinfinite) time r. Thus X is of the formmin {r, min [s: x(s) = C]}. It followsthat v(C) = 0, and

V(1- a) {EaP-1-P' p > r) E(aP1 I p > r)} v(x)=supv)= ((8) 1-EaP+P(p>r){E(aP| p>r)-ar} 1979] GITTINS - BanditProcesses and DynamicAllocation Indices 157 wherep = min{s: x(s) = C} and the expectationsand probabilityare all conditionedby the eventx(O) = x# C. It is interestingto see whathappens to a probleminvolving jobs of thistype as a tendsto one. The rewardVal on completionof thejob at timet may be expressedas V{1 -t(1 -a)}+o(1 -a). Thus, if 1-a is small,the largestcontribution to the rewardwhich depends on t, and thereby on thepolicy for allocating effort to thejob, is givenby theterm - Vt(l -a). Not surprisingly, therefore,given a set ofjobs forwhich the penalties for delays in completionare proportional to the extentof the delays,the DAI policiesdefined by lettinga tendto one in (8), and setting V equal to the cost c of unit delay forthe particularjob, are optimal. The limitof (8) as a tendsto one may be written

r-1 Ir-1 V(X) = sup CZ p(X+i)/ ZP(p>x+iIx(O) = X) (9) r>O i=O /0J The optimalityof the DAI policy based on this expressionfor the schedulingproblem withpenalties proportional to the delays was firstdemonstrated, using a differentmethod, by Sevcik(1972), whose primaryinterest was the schedulingof jobs on a computer. Models of thistype are also applicableto theplanning of industrialchemical research (e.g. see Gittins, 1973). Nash (1973) has shownthat the DAT policyremains optimal for the case withrandom arrivals.This resultis an importantcontribution to the theoryof priorityqueues, as is made clear by Simonovits(1973), and is perhapsthe moststriking consequence of the DAI theorem whichhas so far been obtained.

7. THE MULTI-ARMED BANDIT PROBLEM Let D be a banditprocess whose states are a class of probabilitydistributions for a random variable 0 definedon [0, 1]. Continuationof D at processtime t is definedas observingthe (t+ l)st memberof a sequenceof independentlydistributed random variables Xl, X2,..., each of whichis equal to 1 withprobability 0 and equal to 0 withprobability 1-0. If x(t) has a continuousdensity IT(O), then x(t+ 1) has a densityproportional to fr(0)Oxt+i(l - 0)1-Xt+ll as followsfrom Bayes' theorem.If Xi-, = 1 a rewardas accrues,where s is the timeat which X,+, is observed,and a zero rewardif Xt1 = 0. As in Case 1, a8R(x, 1) is taken to be the expectedreward which accrues at times if D is thenin statex and is continued. Thus

R(x, 1) = E(E0XsJ) = E(0) = J'07(0)d0. (10)

Clearlythe multi-armed bandit problem described in Section1 amountsto findingan optimal policyfor a simplefamily of alternativebandit processes of thistype. This problemowes itspicturesque name to itsresemblance to thesituation facing a gambler witha choice betweenseveral one-armed bandits (or just one multi-armedbandit). It is an intriguingproblem, on which a considerablenumber of papers have been written,recent examplesbeing those by Wahrenbergeret al. (1977) and Rodman (1978). This is probably because it is the simplestworthwhile problem in the sequentialdesign of experiments.Its chiefpractical significance is in the contextof clinical trials. Bellman (1956) gave the first Bayesian formulationand obtained some propertiesof the optimal policy and maximum expectedreward for the case whenthere are two "arms" (i.e. banditprocesses), one of thema standardprocess. As in Section1 we shallsuppose that x(O), and thereforex(t), t e Z+, is a beta distribution. As pointedout by Raiffaand Schlaifer(1961), thisgreatly simplifies any calculations,whilst the two parametersox(O) and ,B(O)allow an arbitraryspecification of the mean and variance 158 GrrrNs- BanditProcesses and DynamicAllocation Indices [No. 2, ofthe prior distribution, which for many purposes is quiteadequate. Thus an arbitrarystate of D maybe representedby the corresponding parameter values (oX, /). ApplyingCorollary 2 of Section4 (and usingthe notation of thatsection), we have,if a, /3and N are non-negativeintegers with N> a+/3, N-ac-fl-1 m R(cx,/3;1)+ E am2EQ(r, oz,P,m,p)R(cx+r,/3+m-r; 1) = r=O a = r= vN l(, /3) sup mN-c-fl-1 (11) 1+ E amE Q(r,x,/3,m,p4). m=1 r=O Here Q(r, a,,/3, m, p) = P{(a(m), /3(m))= (, + r,P + m- r) n vN-t-flM(O((t), /3(t))> p, 1 < t m1.(c(O), /(0)) = ((X, /)}. The expression(11) leadsto thefollowing algorithm for calculating the function vN-ofl(o, /8) fora givenvalue of N. (1) If cx+/3= N-1, thestopping time T in thedefinition of vNO-fl(cx,/3) must be equalto one. Thus,using equation (10), v29-afl(cx,/3= R(cx,/;1) = (cx+ 1)/(cx+/3+2). (12) (2) Equation (12) enables us to calculate the functionQ(r, oc,/3, m,, ) for cx+ =N-2, m = I and r = 0,1. We have Iti'f v V-4fli1(a+ 1,/3) > IL, p Iiif VN-m-f+4(01,/3+ 1)> Q(O, cx,s/3,1 S) = P{X1 =01l(x(0), /(0)) = (as,/3)} x 0lif vy *4 (a/3+ 1) R(cx,/; 1), as is obviousfrom the definition of a DAL. Thiscorresponds to thepossibility that 0 maybe greaterthan R(o,/3; 1). 1979] GiTTINs - BanditProcesses and DynamicAllocation Indices 159

oc Y( ~~a~,O3 / X - iso-DAI curves

FIG. 1. Thedynamic allocation index for the multi-armed bandit problem.

The extentto whichthe iSO-DAIScurve away fromtheir asymptotes for small values of et and ,8 increaseswith the discountingparameter a. This is anotherway of sayingthat v(az, P) increaseswith a forany values of as and P8. It reflectsthe factthat we may expectto find that optimalpolicies differ most fromone-step look-ahead ruleswhen what happens in the moredistant future is comparablein importancewith what happens in the immediatefuture, in otherwords when a is close to one.

8. A GENERALMETOD FOR CALCULATINGDAIS The determinationof DAIS forthe scheduling problem and forthe multi-armed bandit problemusing the methodsdescribed in the previoustwo sectionsdepends on certainspecial featuresof thebandit processes involved. A good generalmethod when the problem does not simplifyin some such fashionis to use the standardbandit processes as a calibrationdevice. Considerthe simplefamily of alternativebandit processesID, Al}formed by an arbitrary bandit process D togetherwith a standardbandit process with the parameterA. Optimal policiesfor { D, A} are DAI policies,and thereforestart by continuingD if v(D) > A,and by continuingthe standardprocess if v(D) < A. If v(D) = A, and only if this is so, an optimal policymay startin eitherof theseways. Our calibrationprocedure consists of findinga value of A suchthat an optimalpolicy for {D, St}can starteither by continuingD or bycontinuing the standardprocess. It thenfollows that v(D) = A. As shown by Blackwell (1965), the maximumtotal expectedreward for any Markov decisionprocess satisfies a dynamicprogramming functional equation. For the family{D, A} thisequation may be writtenas

R({Dl,A}, x) = max[A/(1 -a), R(x,1) + aE.i R({D, A}9y)]. (13) Here R(x,,1) is the rewardresulting from continuing D when it is in state x at time zero. E,, denotesthe expectation with respect to thestate y of D at timeone, giventhat D is in state x at timezero, whencontrol I is applied. It may be noted thatthe standardbandit process has just one state,so thatthe state of D also definesthe state of fD, Al. Also forthis reason the stateof ID, A} does not changeif the standardprocess is continued,and it followsthat a deterministicstationary Markov policymust continue the standardprocess for all timeafter it has done so for one timeunit. If this happens at timezero,, the total expectedreward is A/(1-a), the firstterm on the right-handside of equation(1 3). 160 GITrINS- BanditProcesses and DynamicAllocation Indices [No. 2, Blackwellalso showsthat, provided the maximumtotal expectedreward is bounded over all initialstates, equations of theform (13) maybe solvedeither by one of thepolicy improve- ment algorithmsavailable for the purpose, or by startingwith an approximatefunction, substitutingin the right-handside and thus obtaininga second approximation,and so on. From the DAI theoremit followsthat, for any x, v(D, x) is theunique value of A forwhich the maximumon the right-handside of equation (13) occursboth forthe firstand second terms in square brackets.Thus v(D, x) may be determinedby solving(13) fora successionof values of A in the neighbourhoodof v(D, x). At thispoint the readermay wonderwhat is the point of calculatingv(D, x) in thisway, sincefor any familyY of alternativebandit processes the optimalpolicy and maximumtotal expectedreward may alwaysbe calculateddirectly from the equation,

R(GF,x) = max {R(.F, x, u) + aE, u R(JF,y)}, (14) ueO(x) whichis rathersimpler than (13). Here R(.,x,u) is the rewardresulting from applying u to thefamily F in statex at timezero. E.,,udenotes the expectation with respect to thestate y of .F at timeone, giventhat Y is in statex at timezero, whencontrol u is applied. The answer is that the state-spacefor .F is the product of the state-spacesfor its constituentbandit processes. In generalthis means that the statesx forwhich (13) is solved are of lowerdimen- sionalitythan those involved in (14), and this frequentlybrings an otherwiseintractable problemwithin the bounds of computationalfeasiblity, as illustratedby theexample described in the nextSection.

9. A MULTI-ARMED BANDIT WITH NORMALLY DIsTRmurED REwARDs Let D be a banditprocess whose statesare the set of N(e, m-l) (i.e. normalwith mean e and variancem-1) distributions for a randomvariable 0. Continuationof D at processtime t is definedas observingthe (t+ l)st memberof a sequenceof independentlydistributed N(6, U2) randomvariables X,, X2,..., whereu2 is known. Changes of stateoccur accordingto Bayes' theorem,so that(see Raiffaand Schlaifer,1961)

{(t) m(0)e() + tta and m(t)= m(0)+ ta2, et Mm(0) + tao2 wheree(t) and m(t) are theparameters which define the stateof thebandit process at process timet and ?t = t-1(Xl + X2 + ... + X,). Thus, as forthe ordinarymulti-armed bandit, we have chosen a familyof distributionsfor 0 whichis closed undersampling, a restrictionwhich is virtuallyessential in the ensuingcalculations. The rewardat the(t + l)st observationif this occurs at times is a8 X+L. As before,a8R(x, 1) is theexpected reward if D is continuedin statex at times. Thus ifx = (6, m) thenR(x, 1) = 6. A simplefamily of alternativebandit processes of thistype forms a naturalextension of the multi-armedbandit problem. A model of thistype might well be appropriatein clinical trialsif a numberof treatmentsare to be comparedwhose objectis to controlsome variable whichis measuredon a continuousscale. It is convenientto includein the notationthe dependenceon a of the various quantities whicharise. Thus, forexample, v(e, m, a) denotesthe DAI forD in the state(6, m). It may be shownthat v(e,m, a) = e + av(0,m, 1). (15) The proofis in two stages,proceeding roughly as follows.Firstly, if a constantis added to any set of numbersthen the effectis to add the same constantto any weightedaverage of those numbers.If follows that v(e, m, a) = e + v(0,m, a). Secondly,if any set of numbers is multiplied by a constantthen the effect is to multiplyany weightedaverage by thesame constant,so that v(0,m, a) = av(O,m, 1). 1979] GrrrINs- BanditProcesses and DynamicAllocation Indices 161

The most convenientmethod of calculatingthe DAI functionin this case is to combine equation (15) withthe proceduredescribed in Section8. Equation (13) becomes

R(A,e,m,a) = max[A/(1-a); I y+aR A2, (16' al dG(y)I. U - A ~~m+ccr+ca, Herey denotesthe value ofthe next observation when D is in thestate (m, i), and G denotesits distributionfunction, which may be shownto be N(e, m-1+ U2). Now in viewof equation(15) we need onlysolve equation(16) for =00 and a = 1. Also, arguingalong similarlines to the firstpart of the proofof (15), we have R(, 6,m, a) = el(l -a) + R(A - , O,m, a). Thus to determinethe functionv(e, m, a) we need to solve the equation

R(A, 0,m, 1) = max (A/(1-a); af R(A-mY ,0, m+ 1, 1) dG(y)}, (17) whereG is N(O,m-l + 1). This may be done by substitutinga reasonableapproximation to the functionR(Q, 0, M, 1), fora moderatelylarge value of M, intothe righthand side of (17), settingm+ 1 = M, and hencefinding an approximationto R(Q,0, M- 1, 1), thensubstituting thisin theright-hand side of (17), and so on. It should be noted that theseiterations involve functions of a singlereal variable. Any calculationsbased on equation (14) involveiterations with functions of 2n real variables,and are quite impracticablefor n greaterthan 2. By choosingM to be sufficientlylarge, arbitrarily close approximationsto theDAI function may be obtained. Moreover,a largevalue of M correspondsto a highprobability that if D is in the state(e, M) then0 is close to e. This means thatD is hardlydistinguishable from a standardbandit process with the parameter e, leadingto an obviousclose approximationto the functionR(, O,M, 1). Calculationsalong theselines have been carriedout and will be reportedseparately. The functionv(O, m, 1) turnsout to have thegeneral form shown in Fig. 2. This is because a bandit processin the state(0, m) withm large is verysimilar to a standardbandit process with the parameterzero, whilst the probability that 0 is substantiallygreater than zero increases as m decreases.

v( O, , 1)

e_-m FIG. 2. The dynamicallocation index for the multi-armed bandit with normally distributed rewards.

Robbins and Siegmund(1974) have proposed a heuristicallocation rule for sequential probabilityratio tests between two treatments, which is designedto cutdown the number of testswith the inferiortreatment. It would be interestingto compare the characteristicsof theirrule, which is designedfor the case ofnormal distributions with known variance, with a DAI poliCyusing the function v(6, mi, a). 162 GIrrINs- BanditProcesses and DynamicAllocation Indices [No. 2, 10. A CLASSOF SEARCHPROBLEMS As a firstexample, consider the modification of themulti-armed bandit problem for which the total rewardarises entirelyfrom the firstsuccessful pull, and is equal to a8 if thisis the sth pull to be made. The Markov decisionprocess formed in this way mightbe a suitable model fora situationin whicha numberof differentpopulations are beingsearched with the aim of findingas soon as possiblean individualwith some rarecharacteristic, at whichpoint the searchstops. However,at firstsight the problemis not one whichcan be modelledby a simplefamily of alternativebandit processes, since once a successhas been obtainedon a pull of one armno furthernon-zero rewards may be obtainedfrom any of thearms. This difficulty may be overcomeas follows. Consider the bandit process describedin Section 7 with the followingmodifications. If X2+,=1 a zero reward accrues. If X,+, = 0 a zero reward accrues if at least one of X1,X2, ..., Xi is equal to one; otherwisea rewardequal to -aa8 accrues,where s is the timeat which Xt1 is observed. The state-spacemay be definedby adding to the state-spacefor an arm of a multi-armedbandit a stateC, indicatingthat a successhas occurred. For a banditprocess of thistype the DAI is negativeuntil the first success occurs, and there- afterequal to zero. Consequentlyan optimal policy for a simplefamily Y of alternative banditprocesses of thistype will alwaysselect for continuation a banditprocess in stateC if thereis one available. If none of the bandit processesin JF is initiallyin state C and the firstsuccess occurs at thesth trial,then all subsequentrewards are equal to zero and the total rewardis (a8- 1)/(1-a). An optimal policy for Y is thereforeone which maximizesthe expectationof a8, and is an optimalpolicy for our modifiedmulti-armed bandit problem. Thus the optimalpolicies for our searchproblem are thosegiven by the DAI theoremfor the correspondingF. It may be shown,and indeed it is fairlyobvious, that v(x(l)) < v(x(0)) unless x(l)= C. It follows,using an argumentsimilar to those used in Section5 and assumingthat x(O) is a beta distributionas in Section7, thatv'(01 fi) = v(Q,f) if 1I f XI O., oo if X1 1. Thus

v(oO) = 1 -1 xP{X1 O=?lX(0) =(a, I +a(l -a)-'P{X, = 1Ix(O) = (a,)} This is a strictlyincreasing function of P{X1 = 1Ix(O) = (a, f)}. It is thereforeoptimal to use thisprobability, which is equal to (a.+ 1)/(x+fl+ 2), as a DAI. This meansthat a one-step look-aheadpolicy is optimalfor our searchproblem. The banditprocess described in Section9 may also be modifiedso as to model a search problem. Suppose thatif XI1 belongsto some measurablesubset B of thereal linethen a zero rewardaccrues; and if Xt+l 0 B thena zero rewardaccrues if at least one of (X1,X2, ..., Xt) e B, and otherwisea rewardof -aa8 accrues,where s is the timeat whichX4+ is observed. Again we add a state C, indicatingthat an observationbelonging to B has been made, to the state space for the multi-armedbandit with normally distributed rewards. This timeit turnsout thata one-steplook-ahead policy is not in generaloptimal. A simplefamily of alternativebandit processes of thistype might be a suitablemodel if the aim is to findas soon as possiblean individualbelonging to B fromany one of a number of populations.The DAI functionmay be calculatedas describedin Section 8. Some results forthe case B = [0,oo) are describedby Jones(1970). Clearly a range of differentsearch problems (and correspondingmulti-armed bandit problems)may be modelledby consideringdistributions other than 0-1 and normalfor the observationsX1, X2,.... For example, Gittinsand Jones (1974b) have prepared a set of 1979] GITTINS- BanditProcesses and DynamicAllocation Indices 163 tables based on negativeexponential distributions with an added probabilityatom at zero. These are designedfor use in the screeningof chemicalsin new-productchemical research. Glazebrook (1978b) considersa multi-armedbandit problem in whichseveral different out- comes,rather than just two, are possible at each trial.

11. POSSIBLEFURTHER DEVELOPMENTS The exampleswhich have been describedshow that there is considerablescope forapplying the notionsof forwardsinduction policies and dynamicallocation indices, using the theorems of Section3. However,at thisstage the storyis incomplete.Later instalmentsmay touchon the followingpoints. (i) There may well be typesof Markov decisionprocess other than families of alternative banditprocesses for which forwards induction policies are optimal. A simplecharacterization of the class of Markov decisionprocesses with this property would be useful,since in many cases forwardsinduction policies are relativelyeasy to determine. One exampleof a Markovdecision process for which forwards induction policies are known to be optimal,and whichis not a familyof alternativebandit processes, is describedby Black (1965). This is a searchproblem for which an objectis hiddenin one of a numberof boxes. For each box thereis a detectionprobability on searching,if it containsthe object,and a cost. The aim is to findthe object at minimumcost. (ii) At presentone is muchmore aware of the above-mentionedscope forpractical appli- cationsthan of such applicationsactually being made. We hope thissituation will change.

ACKNOWLEDGEMENTS I am very gratefulto Mr A. G. Baker of Unilever Research, Port Sunlight,for his encouragementover severalyears, and for namingthe dynamicallocation index. I should also like to thankDrs K. D. Glazebrook,D. M. Jonesand P. Nash formany enjoyable and stimulatingdiscussions, and thereferees, whose comments on earlierdrafts have led to a much improvedpaper.

REFERENCES BELLMAN,R. E. (1956). A problemin thesequential design of experiments.Sankhyd A, 16, 221-229. - (1957). DynamicProgramming. Princeton: Princeton University Press. BLACK,W. L. (1965). Discretesequential search. Informationand Control,8, 159-162. BLACKWELL,D. (1965). Discounteddynamic programming. Ann. Math. Statist.,36, 226-235. CHOW,Y. S., ROBBINS,H. and SIEGMUND,S. (1971). GreatExpectations, the Theory of OptimalStopping. New York: HoughtonMifflin. DAVIES,D. G. S. (1970). Researchplanning diagrams. R and D Management,1, 22-29. GITTINS,J. C. (1973). How manyeggs in a basket? R and D Management,3, 73-81. - (1975). The two-armedbandit problem: variations on a conjectureby H. Chernoff.Sankhyd A, 37, 287-291. - (1979). Two theoremson banditprocesses. Submitted for publication. GITriNs, J. C. and GLAZEBROOK, K. D. (1977). On Bayesianmodels in stochasticscheduling. J. Appl. Prob.,14, 556-565. GITTINS,J. C. and JoNEs,D. M. (1974a).A dynamicallocation index for the sequential design of experiments. Progressin (J. Gani, ed.), pp. 241-266. Amsterdam:North-Holland. - (1974b). A DynamicAllocation Index for New-productChemical Research. CambridgeUniversity EngineeringDept CUED/A-MgtStud/TR13. -- (1979). A dynamicallocation index for the discounted multi-armed bandit problem. Biometrika (to appear). GITrNS, J. C. and NASH,P. (1977). Scheduling,queues, and dynamicallocation indices. Proc. EMS, Prague1974, pp. 191-202. Prague: CzechoslovakAcademy of Sciences. GLAZEBROOK, K. D. (1976a). A profitabilityindex for alternative research projects. Omega, 4, 79-83. - (1976b). Stochasticscheduling with order constraints. Int. J. Sys. Sci., 7, 657-666. - (1978a). On a class of non-Markovdecision processes. J. Appl.Prob., 15, 689-698. - (1978 b). On theoptimal allocation of two or moretreatments in a controlledclinical trial. Biometrika, 65, 335-340. 164 Discussionof Dr Gittins'Paper [No. 2, JONES,D. M. (1970). A sequentialmethod for industrial chemical research. M.Sc. Thesis,University of Wales,Aberystwyth. NASH, P. (1973). Optimalallocation of resourcesbetween research projects. Ph.D. Thesis,Cambridge University. NASH, P. and GITrINS, J. C. (1977). A Hamiltonianapproach to optimalstochastic resource allocation. Adv.Appl. Prob., 9, 55-68. RAIFFA, H. and SCHLAIFER, R. (1961). AppliedStatistical Decision Theory.Boston: Harvard Business School. ROBBINS, H. and SIEGMUND, D. 0. (1974). Sequentialtests involving two populations.J. Amer.Statist. Ass.,69, 132-139. RODMAN, L. (1978). On themany-armed bandit problem. Ann. Prob., 6, 491-498. Ross, S. M. (1970). AppliedProbability Models with Optimisation Applications. San Francisco:Holden-Day. SEVCIK, K. C. (1972). The use of service-timedistributions in scheduling.Technical Report CSRG-14, Universityof Toronto. SIMONOVITS, A. (1973). Directcomparison of differentpriority queueing disciplines. Studia Scientiarum MathematicarumHungarica, 8, 225-243. WAHRENBERGER, D. L., ANTLE, C. E. and KLIMKO, L. A. (1977). Bayesianrules for the two-armed bandit problem.Biometrika, 64,172-174.

DISCUSSION OF DR GITTINS' PAPER ProfessorJ. A. BATHER(University of Sussex): I shall restrictmy commentsto the multi-armed bandit problem described in Sections 1 and 7 of Dr Gittins' paper. He remarks that "its chief practical significanceis in the contextof clinical trials". This is true,but I would like to spend a fewminutes considering why, after many years of study,there has been so littleeffect on theconduct of sequential medical trials. In the notationof Section 7, 01, 02, ... Oin are the unknownprobabilities of success in n different sequences of Bernoulli trials or, alternatively,we can thinkof a single sequence of patientsand n possible treatmentsfor any one of them. The problemis to finda rule forallocating a treatmentto each patientso thatthe numberof successfultreatments is maximized,in some sense. Suppose that, aftera total of t trials,we have observed ri successes in mi trialswith treatmenti. The proportion of successes achieved so faris rlt,where r = Erj and t = Emi, summingover i from1 to n. We need a rule which tells us which treatmentshould be given to the next patientin the sequence. The optimizationproblem is not well definedwithout further assumptions, which Dr Gittens expresses in the choice of a prior distributionand a discount factor a< 1. Even then, there are genuine difficulties:his resultthat the optimal policy can always be expressedin termsof dynamic allocation indices is a very impressivereduction of the problem, but the procedure described in Section 7 is still verycomplicated (see also Fabius and Van Zwet, 1970). I would like to ask Dr Gittins about the sensitivityof the optimal policy to changes in the prior distributionand in the discount factor,particularly as a t 1 which is the most importantspecial case. It seems to me that we might do well to consider somethingless than exact optimality;I think the best may be the enemy of the good. I will conclude witha suggestionwhich I hope is constructive.Consider a familyof sequential decision procedures depending on a randomized allocation index. The randomization is useful even though it is not a direct consequence of any particular optimalitycriterion. Let {Am} be a sequence of positive numbers such that Am.-0 as m-oo and let Xit, i = 1,2, ..., n, t = 1,2, .... be i.i.d non-negativerandom variables with a distributionwhich is unbounded. Given the record of successes and failuresin the firstt trials,the next treatmentis chosen according to max {ri/mj+ AmjXit}.

In otherwords, the next treatmentmust be one of the current"favourites" according to an index made up of the observed proportionof successes and a positive bias. The idea is that the random termswill tend to favour those treatmentswhich have so far had relativelyfew trials. Any such decision procedure is asymptoticallyoptimal in the followingsense. Suppose that 01> 02> 03>... > 0,. Then the random variables r,(t) and m,(t) have the propertythat, with probability1, m1(t)/t-+ 1 and Zri(t)/t-+ 01 as t -+ oo, so the observed proportionof successes in all the trialsconverges to max (01, 02, ..., On). This resultis a consequence of the strong . As Robbins pointed out (1952), it is easy to constructdecision procedures which are 1979] Discussionof Dr Gittins'Paper 165 asymptoticallyoptimal, but not all of themare "good'. I claim thatsome of the randomized allocationprocedures obtained by definingAm = /rmperform well for any values of theunknown probabilitiesand overshort as wellas long sequencesof trials. However,the evidence for this is not by anymeans complete. Dr Gittinshas certainlyprovided us withplenty of food for thought and I hopemy introduction ofa rivalindex will not confuse matters nor delay still further the time when the theory of sequential decisionsis translatedinto practice. I havemuch pleasure in proposinga voteof thanks.

ProfessorP. WHrITLE (StatisticalLaboratory, Cambridge): We shouldrecognize the magnitude of Dr Gittins'achievement. He has takena classicand difficultproblem, that of themulti-armed bandit,and essentiallysolved it by reducingit to thecase of comparisonof a singlearm with a standardarm. In thispaper he bringsa numberof furtherinsights. Giving words their everyday ratherthan their technical usage, I would say thatmy admiration for this piece of workis un- bounded,meaning, of course,very great. Despitethe fact that Dr Gittinsproved his basic resultssome sevenyears ago, the magnitude of his advancehas not been generallyrecognized and I hope thatone resultof tonight'smeeting willbe thatthe strength of hiscontribution, its nature and itssignificance will be apparentto all. As I said,the problem is a classicone; it was formulatedduring the war, and effortsto solveit so sappedthe energies and mindsof Allied analysts that the suggestion was madethat the problem be droppedover Germany,as the ultimateinstrument of intellectualsabotage. In the event,it seemsto have landed on CardiffArms Park. And thereis justicenow, for if a Welsh Rugby pack scrummingdown is nota multi-armedbandit, then what is? And the name of DAI seemsthen also well chosen. But whatis surprisingis the hedonistic originof theDAI concept,and of theforward induction principle. To someonebrought up on the conventionalbackwards induction principle, like myself,the notionof a terminalreward or a terminalcost is an ingrainedone, expressing as it does theconsequences in thehereafter of one's actionsin thepresent. But DAI has no consciousnessof the hereafter, he behavesliterally like there was no tomorrow,grabs what he can whileit lasts, and thenopts out. It is stillsomewhat unclear to me howit is thatan optimalstrategy can ignorethe future to thisdegree; it mustbe, as Dr Gittins says,because the bandit formulation allows one to postpone certain courses of action without prejudice. Dr Gittinshas giventhe interpretation ofSection 8 in otherpapers (i.e. thecalculation of DAI bycalibration against a standardarm) but the interpretation of Sections 2 and 3 is newto me. This is thecharacterization ofDAI as themaximal reward rate up to somestopping time. This is reminis- centof thecharacterization of averagecost optimality by the maximization of rewardrate up to a stoppingtime defined by recurrenceto theinitial state. However,again thereis a contrast:this lattercriterion shows the awareness of moralprinciples, of whichDAI iS so lamentablynegligent, in thatit observesthe precept "leave thingsas you foundthem". I reallyhave no contributionof substanceto make. Obviouslythere are manyquestions one couldask, and generalizationsone could suggest,but it seemsmost appropriate at themoment to congratulateDr Gittinswarmly on havingdeveloped a powerfuloptimization technique of great practicaland conceptualsignificance. [A furthercomment added in writingafter the meeting]: An indexresult which I mightmention concernssequential choice of experiment(types of experimentbeing indexed by u) foroptimal discriminationbetween two simple hypotheses. The criterionfor choice of u givenin Theorem4 of Whittle(1965) can be moresimply expressed: choose the u forwhich Yu P1 + 8UP,is minimal.Here P1and P2are theprobabilities of thetwo hypotheses conditional on currentinformation, and y,,,au are thequantities defined in thepaper quoted; essentially ratios of cost of experiment to Kullback- Lieblernumber for experiment u. The ruleis optimalto withina no-overshootapproximation-I shouldbe interestedto knowif it could also be derivedby themethods of Dr Gittins'paper. The voteof thankswas passedby acclamation. Mr D. G. S. DAvrEs:I shouldlike to speak fromthe standpoint of a researchplanning man ratherthan a statistician.I shouldalso liketo congratulateDr Gittinsand to drawattention to two featuresof his workwhich I thinkare important. First,the idea ofa forwardsinduction policy is important.I knowthat many decision problems can be solved-perhapsall ofthem-by a backwardsinduction policy but, as Dr Gittinshas pointed 166 Discussionof Dr Gittins'Paper [No. 2, out,this is oftenprohibitively difficult to calculate.In theworld of research it is extremelydifficult to getresearch workers to come up withdata, and particularlyto place anycredence in longand involvedcomputer calculations based upon thedata whichthey have produced. If we can develop figuresof merit and indiceswhich are soundlybased, and whichcan be usedfor allocation of effort in a forwardssequential manner, and ifit is simplya matterof looking these things up in thetables, providedthat the model is appropriate,I am surethat this is somethingwhich the research worker at thebench would be preparedto contemplate.However, if it is a matterof doinga large-scale modellingexercise on his project,then sending it away forcomputer analysis, he is muchmore reluctantabout it-I speakfrom bitter experience. Secondly,Dr Gittinshas emphasizedthe distinctionbetween the DAI and the probabilityof successfor the different routes. If we takethe very simple model of thebandit, basically all we do is to carryout trials. If theysucceed, that is fine;if theydo not succeed,we do anothertrial. Dr Gittinshas emphasizedthat we are gaininginformation as we do thetrials, which gives us a potentialway of re-evaluatingwhich route to take,based on theway the trials are done. In the ratherrestricted range of applications in research where the DAI can be appliedas itstands, informa- tionis gainedsimply by gaining an enhancedview of the frequency of occurrence of successes in any chosenroute. However,this is onlyan exampleof a moregeneral phenomenon, that in research generallythere is always a conflictbetween going for immediateexploitation and going for information. Veryoften either we can do a trialstraightaway, in which case we maysucceed immediately, or we can do somebackground work instead which we hopewill give a greaterchance of success when thetrial finally is made. Thisis theconflict-also mentioned by Professor Whittle-and there is a contributionhere in theDAI in whichsome of theseconsiderations are incorporatedinto the index itself. One caveatis thatthis is a verylimited model, with limited application in researchand develop- ment. In researchand developmentwe liketo projectizeour work-by "project",I meana piece of worksuch that we can tellwhen it is finished.This particular method is applicableto a lifetime's workwhere we are continuallydoing trials-in theexpectation, it is true,that they will come to fruition.But, as Dr Gittinssaid, it is a methodwith an unlimitedtime horizon. We liketo be able to set finitetime horizons in researchand development.We hope thatthere is a learningcurve superimposedon thework that is goingon, so thatwe are notsimply pulling the arm of the bandit all thetime but also modifyingthat bandit as it goes along. I feelsure that this concept can be incorporated,but at presentI am notabsolutely certain how to do it. I shouldlike to hope that we can go furtherand obtainmore indices of thiskind that are applicablein a forwardinduction sense-let us not worrytoo muchabout thembeing optimal becausethat does not matteras longas theyare useful.There are manyprecedents for this. For example,if we are schedulinga criticalpath networkunder resource constraints, this cannot be done optimallybecause we are up againstcompletely prohibitive combinatorial problems if any non-trivialplan is attempted,if we tryto do it optimally.We can, however,still develop useful heuristicrules which will take us forwardin a powerfulway.

ProfessorB. FRISTEDT (Universityof Liverpool): A big assumptionis that the discount sequence,denoted by (at: t = 0, 1, ...) by Dr Gittins,is geometric.That one wantsthere to be stationarypolicies that are optimalis notthe only reason for this assumption. As Dr Gittins(1975) has indicated,without some such assumptionthe principleis not valid thatmulti-armed bandit problemsmay be solvedby comparingeach banditto a standardbandit. It is notclear that arbitrarily good approximationsof R(A,0, 1, 1) can be obtainedvia equation (17). Conceivably,if M is chosen so large that R(-, 0, oo,1) is a good approximationof R(-, 0, M, 1), thenthe small initial error may grow through M- 1 iterationsinto a substantialerror. Suppose,in Section2, one definesE>atR(x(t), u(t)) to equal - oo whenaccording to theusual conventionsit does notexist, even as + oo or -oo. Does Blackwell'sTheorem then hold withno assumptions,other than measurability, on R? I believeit does. Equation(11) does notdepend on 0 havinga betadistribution, since an arbitrarystate that may occurcan, for any initial distribution with or withouta density,be expressedin terms of the numbers a (successes)and ,B(failures). In manysituations I thinkthat the only good alternativeto a Bayesianapproach is a minimax approachinvolving a riskfunction. See Fabius and van Zwet(1970). In case one feelscompelled 19791 Discussionof Dr Gittins'Paper 167 to avoid a Bayesianoutlook I thinkit is unrealisticto do so byregarding a firstcertain number of trialsas merelyexperimental and theremaining trials as havingno aspectof theexperimental in them. Real-worldproblems do arise in whichexperiment, decisions, and acts based on those decisionsare inherentlyinterwoven.

Mr A. G. BAKER (UnileverResearch Laboratory, Wirral): Arising from the discussions in 1966 at theOR Conferencein Edinburgh,may I add mythanks and congratulationsto Dr Gittinsfor theprogress made by him and his colleaguessince then. I shouldlike, though, to bringout theimplications of thiswork, as I see them,to a practising statistician-whichis slightlyrelated to Mr Davies' comments.There are twoways in whichthis workmay be used: first,in theformal mathematical sense. For that,we wouldalways be dependent on thetheory being developed. Secondly,there are otheraspects of thiswork which a practisingstatistician can alreadyuse. He can use thearguments, and themental approach suggested by Dr Gittins'work in his debatewith researchcolleagues on howto tackle a programmeof work. This is important;the fact that there are theoreticaljustifications for looking at howto proceedfrom the approach of the theorem on DAI, in particularthe concept that it sometimespays to buyinformation. Mr Davies referredto thisas "backgroundwork", which is not a termI would use because it reallyis buyinginformation, whereasbackground work is morea matterof basicresearch. Those twopoints are theones I shouldlike to stress.Dr Gittins'work has giventhe practising statisticiana basis for arguing on buyinginformation, and theimportance of doing so and,secondly, theimportance of proceedingby usingthe DAI theorem.

Dr F. P. KELLY (Universityof Cambridge):Today's paper reviewsan extremelyimportant advancein thetheory of Markovdecision processes whose ramifications are widespreadand still notfully explored. To illustratethis I shalldiscuss two relatively old problemsin thefield where the DAI theoremcan be usedto extendthe best known results, recently obtained by Kadane and Simon (1977). The firstis thesearch problem referred to by Dr Gittinsin thefinal section of his paper, whichcan be describedas follows.An objectis hiddenin one of n boxes. Initiallythe probability thatthe object is in box i is P(i). Thejthlook in box i costsc(i,j) and detectsthe object, given that itis in thebox, with probability d(i,j). A policyis an infinitesequence b1 b2 ..., wherebt is thebox to be lookedin at timet ifthe object has not beenfound before then, and theaim is to minimizethe expectedcost incurred until the object is found.I shalldeal firstwith the case c(i,j) = 1, wherethe aim is to minimizethe expectedtime till the object is found. Considerthe relateddiscounted decisionprocess in whichno costsare incurred,a reward at is obtainedif the object is foundat time t, and thesearcher is not told whetheror not he has yetfound the object. A policyis again an infinitesequence b, b2 .... If thispolicy requires that at timet box i be lookedin forthe jth time, thenthe expected reward at timet is atR(i, j), where R(i,j) = P(i) 1(1 -d(i, k))} d(i,j), theunconditional probability the object is foundon thejthlook in box i. The discounteddecision processis thusa familyof alternativebandit processes. Let T be thetime at whichthe object is found. ProvidedET is finite E(a') = 1-(1-a) ET+ o(l-a), and thepolicy minimizing ET can be deducedfrom the optimal policy for the discounted decision process.The originalproblem in whichthe c(i,j) are not all equal can also be recastas a family of alternativebandit processes provided ; c(i,j) divergesfor each i (summingover j from1 tooo); we just let c(i,j) be the timeit takesto look in box i forthe jth time. The conclusionis thatif v(i) = maxt>o{Y R(i,j)/l c(i,j)} (wherethe summationsare overj from1 to t) thenthe optimal policyfor the original problem begins by lookingin thatbox i forwhich v(i) is a maximum. The secondproblem I shall discussis the gold-miningproblem first formulated by Bellman (1957). A man ownsn gold minesand a delicategold-mining machine. Each day theman must assignthe machine to one ofhis mines. When the machine is assignedto minei forthe jth timethere is a probabilityp(i, j) thatit extractsan amountof gold r(i,j) and remainsin workingorder, and a probability1 -p(i,j) thatit extractsno gold and breaksdown irreparably. The man'saim is to 168 Discussionof Dr Gittins'Paper [No. 2, maximizethe expectedamount of gold extractedbefore the machinebreaks down. Let s(i,j) = -logp(i,j), and interprets(i,j) as the"time" it takesto workmine i on thejth occasion the machineis assignedto it. Withrespect to thisstandard time scale the machineremains in workingorder for an exponentiallydistributed period independent of thepolicy adopted, provided fIlrlp(i,j) = 0 foreach i. The man'sproblem thus corresponds to therelated decision process in whichthe machine works for ever, but an amountof gold r(i,j) extractedat standardtime s is worth e8 r(i,j). This decisionprocess is a familyof alternativebandit processes, and so the optimal policybegins by looking in thatmine i forwhich

V(i) = sup[ r(i,j) Hi p(i, k)/(1 - p(i,j)] t>O j=l kl- j=l is a maximum. The resultsjust describedhave been established using a differentmethod by Kadane and Simon (1977),who also considerthe problemsunder arbitrary precedence constraints. Observe though thatboth problemsare essentiallydeterministic: the optimalpolicy does not have to adapt to informationbecoming available with time. The advantageof formulatingthe problems as families of alternativebandit processes is thatthis allows the results to be generalizedto thecase wherethe characteristicsof box or minei are notcertain but have probability distributions which alter as box or minei is investigated.As a simpleexample suppose that in thesearch problem the jth look in box i is informativewith probability D(i,j) and uninformativeotherwise. An informativelook determineswhether or not the box containsthe object,and an uninformativelook yieldsno indicationeither way. Put moreprecisely this is equivalentto theassumption that the detection probabilitiesare independentBernoulli random variables with E{d(i,j)} = D(i,j), and thatd(i,J) becomesknown after the jth look in box i. If v(i) = P(i) sup |f[[ |( H - D(i, k))} D(i,j)]/[ (11 (1- D(i, k))} c(i,i)] t>o = = - = 11 thenthe optimal policy begins by lookingin thatbox i forwhich v(i) is a maximum. Dr D. M. ROBERTS(Ministry of Defence): My firstcomment on Dr Gittins'paper concerns the practicalsignificance of theconcept of a DAI. One area in whichI have recentlybeen looking at thisis newproduct chemical research. Specifically, one is confrontedwith a numberof research projectsall competingfor a limitedamount of effort. The wayin whicheach project is characterized tendsto be complex.For in orderto be realistic,account must be takenof suchfactors as theway in whichthe effectiveness ofresearch effort varies with time, the chances of success as a functionof usefulwork done, as wellas variousfinancial parameters. Thus a casual look at thepossibilities giveslittle indication of whereeffort should be appliedand at whatlevels. However,it is possibleto writea computerprogram which does two things.First, for any plannedallocation, it showsthe expected profitability ofsuch an allocation.And, second, for each project,it calculates the DAI. A comparisonof indices suggests ways in whicheffort might profitably be reallocatedbetween projects, either as a modificationof the initial allocation or, since the indices are functionsof time,at an appropriatetime within the forecast period. I havejust completedthe development of sucha programand runscarried out so fartend to indicatethat, in spiteof thecomplexity of detailsurrounding each project(which means that the DAI Theoremis not directlyapplicable), the DAI providesus withan effectivesingle measure for comparingprojects. My secondobservation on Dr Gittins'paper concerns his reference to thesearch problem where an objectis hiddenwith known occupation probability distribution in one of a numberof boxes. It has been shownthat, to minimizethe expected cost of the search,one shouldlook in thebox wherethe productof two terms-theprobability of the object beingthere and the detection probability-dividedby the cost is greatest.Although this principle is generallydemonstrated using a dynamicprogramming approach, the optimal strategy is actuallya forwardsinduction policy, and it is interestingto notethat Ross (1970) is able to deriveits formsolely by consideringtwo-step look-aheadpolicies. Inevitablytherefore, one is leftwondering whether the ForwardsInduction Theoremcan be extendedto coverthis situation. Dr K. D. GLAZEBROOK(Newcastle University): I shouldlike to put on recordmy thanks to Dr Gittins,not only for his interesting paper but also forbeing an immenselyhelpful and stimulating 1979] Discussionof Dr Gittins'Paper 169 supervisorand colleague.I feel,too, thatafter some years of familiaritywith these results one is inclinedto be blas6 about themand forgethow demandingthese problems have been to solve. PerhapsI could makejust threepoints: (i) As Dr Gittinsindicated, the policy which maximizes the total expected reward earned by a familyof N alternativebandit processes during [0, M], M fixed,is notin generala forwardsinduc- tionpolicy. Suppose,though, that we considerthe problem of maximizingthe expectedreward earnedduring [0, -z], X- an integer-valuedstopping time, and ask forwhat X- is therea forwards inductionpolicy which is optimal?Two importantexamples where this is thecase are: X =inf {t; xi(t) E C, i = 1, ,N} (1) t>o and = inf{t;xi(t) E Ci forsome i}, (2) t,>o wherexi(t) is thestate of bandit process i at timet and Ci is somesubset of the state space of bandit processi. The schedulingproblem discussed by Dr Gittinsin Section6 is an exampleof (1) and the searchproblem in Section10 an exampleof (2). (ii) We mightwant to makestopping part of our decisionstructure; this could well be so in problemsrelating to researchplanning and clinicaltrials. We couldmodel this by having a choice of 2N actionsat each decision-epochinstead of N as previously.These actions would be "continue banditprocess i", i = 1, ..., N, and "stop and decidein favourof banditprocess i", i = 1, ..., N. I have obtainedsome optimal policies for such problems as these(Glazebrook, 1979). (iii) Many of thecontinuous-time analogues of the discrete-timedecision processes discussed herewill be controlledjump processeswith the discountedcost criterion.Suppose thatsuch a processis in statei at time0, is subjectto controlu untilits first transition, and is subjectto an optimalcontrol (if any such exists) thereafter. Let R[i,u] be theexpected return from such a policy and letV<, be theoptimal return function under discount rate a> 0. Underappropriate conditions we have that V(i) = inf{R[i, u]}, (3) u theinfimum being over all admissiblecontrols u. For a widerange of decision problems in research planning,stochastic scheduling and queueing(and indeedmany continuous-time analogues of the problemsdiscussed today), the optimal control problem stated in (3) looksvery similar to a problem solvedby Nash and Gittins(1977). Indeedso muchso thatI feelit may well be worthwhiledefining a class of controlledjump processeswhich reflect the rather strange property that they may be solvedby thetechniques discussed there. Dr M. A. H. DEMPSTER(Balliol College,University of Oxford):I shouldlike to make a few briefremarks concerning an importantarea of practicalapplication-scheduling problems in a stochasticenvironment. As pointedout elsewhereby Dr Gittinsand his associatesstochastic schedulingproblems arise in computerscheduling, reliability and R and D managementas wellas in factoryscheduling. However, it is in thelatter area wheremy own interestand theseremarks are centred.(I am currentlyinvolved in a collaborativeeffort in thisfield with Fisher, Lageweg, J. K. Lenstraand RinnooyKan, cf.Dempster, 1979.) In manufacturingjob shops,a three-levelhierarchy of planningdecisions may be outlinedin termsof increasinglyfiner time units. The firsttwo levelscan currentlybe handledby known deterministiclinear programming and combinatorialpermutation procedures, but the third- concerningthe sequencing of jobs througha singlemachine centre-is directly related to Dr Gittins' paper. At thislevel practical production scheduling involves a stochasticm-machine problem whose naturalsetting is in continuoustime. Veryrecently Dr Gittinsand his co-workershave obtainedresults for discrete time problems whichshow that DAT policiesare optimalfor the m-machine scheduling problem with a fixedqueue of jobs j whoseprocessing times t; are independentrandom variables. The discretedistributions F3tinvolved are eitherexponential, i.e. constantcompletion rate (cf. failurerate in reliability theory),monotone completion rate-either increasing or decreasing-andidentical in thesense that theyare all conditionaldistributions of thesame distribution after arbitrary amounts of processing (Weberand Nash, 1978; Weber,1979) or non-overlappingcompletion rate in the sensethat the originalmonotone ordering of job processingtime completion rates ft,(O)/(l -F t(O)) is notchanged 170 Discussionof Dr Gittins'Paper [No. 2, by subsequentprocessing (Gittins, 1979). It does not appear to be an entirelytrivial technical matterto extendthese results to continuoustime. Althoughthe optimalm-machine sequencing policiesto minimizerespectively expected makespan and expectedflowtime essentially longest and shortestexpected processing time first (LEPT and SEPT)-are DAI policies,the methodsused appear particular.It wouldbe interestingto investigatehow thegeneral approach of Dr Gittins' papercould be utilizedto obtaincontinuous time results. In thisregard I shouldlike to call attentionto thework of Dr Weiss,who is currentlyvisiting BirminghamUniversity. Following recent work of Brunoand Downeyand Fredrickson,he has shownwith Pinedo (1978) that the above results regarding suitable variants of the LEPT and SEPT DAI policiesare optimalfor the problemof sequencingjobs withexponential processing times on m machinesof differingspeeds. From the pointof viewof practicaloperations research this is an extremelyimportant result which we mighthope to obtainmore generally for continuous time stochasticscheduling problems using bandit process theory. Therehas recentlybeen a considerable,deep and detailedcombinatorial study of deterministic schedulingproblems (in continuoustime) as to theircomputational complexity (see Grahamet al., 1977). In layman'sterms the simplequestion addressed is whetheror not it is possibleto finda computationalalgorithm for a deterministicscheduling problem that is polynomialin theproblem parameters(easy) or whetherthe parameter dependency must be effectivelyexponential (NP-hard). For eventhe two-machine problem of minimizingmakespan with no pre-emptionof runningjobs, theproblem is knownto be NP-hardin the deterministiccase. On the otherhand, a LEPT (DAI) policyis oftenused to sequencejobs in a practicalm-machine problem-such as fora bankof lathes ina machineshop. Thecurrent theoretical operations research view, based on deterministicanalysis, wouldsay that such a policyis a suboptimalheuristic (cf. Graham et al.). The interestingproperty of the Weiss-Pinedoresult is that this policy is indeed optimalas soon as specificrandom processingtimes are allowed. If extensionsof these results could be foundfor different distributions (as in thediscrete time case) and in morecomplex scheduling problems involving release and due dates(which are closerto thosein thereal world),we wouldhave the extremely important result thatheuristics which have beenderived from practical experience can be provedoptimal when we havethe right model-namely one involvingrandom variables. Finally,going considerablyfurther, a problemarising in understandingof real job shops involvesthe analysis of a networkof m-machineproblems. There the work of Dr Kellyand his associatesat Cambridgeon networksof queues,and relatedwork in the U.S. and Europe,will hopefullysoon be relevantto stochasticproduction scheduling. Each node of the appropriate networkwould be notsimply a singleserver but rather a scheduledm-machine system, so thatinput and outputprocesses would be considerablymore complicated than we haveso farseen. Neverthe- less, thereis some hope thatthe eleganttheory of Walrandand Varaiya(1978), developedfor queueingnetworks, could be appliedmore generally. This is a big programme,but I mustemphasize that there is muchof practicalimportance in it foroperations research-both regarding computer networks and forfactory scheduling.

Dr J. POLONIMCKI: Dr Gittins'proposed solution to the infinitehorizon multi-armed bandit problemhas a verysurprising feature. The methodconsists of lookingat a functionof the data (r successes,n trials)on each of thearms at a time;and thendeciding for the next step to use that armfor which this function has thelargest value. One-stepahead horizonoptimal solutions can clearlybe expressedin thisway. The two-stepahead horizonoptimal solution cannot. In viewof this surprising feature of the solution, the name "DAI" doesnot do justiceto itsappeal. A statisticianknows not to look to theobserved average rate of successof thearm (rln) for an optimaldecision, nor to theexpected rate of success{(r + 1)/(n+ 2)}, nor to theexpected waiting timeto thenext success (cf. r/(n + 1)). The DAI tellsus to look at the"maximum expected rate of return",and choosethe arm for which this is thelargest. For practicalapplication, we needa setof tables(one tableper discountfactor). These tables are notyet available, although Glazebrook (1978b) shows how they would be used. It is notclear, however,what happens as the workingboundary for their calculation is extended.For clinical trialwork there is needed,in addition,some reappraisalof the decision-makingrole of clinical trials. Is the"maximum expected rate of return" policy as optimalas Dr Gittinssuggests? It is based on comparingan unknownprocess with a standardprocess, and we are told thatthe optimal 1979] Discussionof Dr Gfttins'Paper 171 procedurehere must have theproperty that once theknown process has beenused it willbe used thereafter.Clearly such a policyis notasymptotically optimal, in thesense that there is a positive probabilitythat the known process will be usedan overwhelmingproportion of the time, when the probabilitythat it is superioris notequal to one. Havingbeen told that the "optimal" procedure is not asymptoticallyoptimal, it is disturbingthat there exist procedures which are. The existence of asymptoticallyoptimal or "convergent"procedures has been shown under fairlygeneral conditions(Poloniecki, 1978). Dr G. WEISS(Birmingham University): I wantto congratulatethe author for pinpointing two importanttheorems, the forward induction and theDAI theorem and forshowing how they underlie thescheduling and themulti-armed bandit problems. It seemslikely that the theorems continue to hold whenthe definitionof a banditprocess is extendedto be a semi-Markovdecision process, wherethe continuationcontrol is associatedwith a transitionto anotherstate, a rewardand, in addition,a randomtime that passes until the next decision. In thescheduling context this formula- tionincludes the scheduling problem when no pre-emptionsare allowed.Harrison (1975) has calcu- latedDAI'S forthat case. A furthergeneralization of thebandit process is to allow therandom emergenceof new banditprocesses when the continuationcontrol is applied. This allows the treatmentof arrivals as wellas morecomplex feedback situations (see Meilijsonand Weiss,1977). On ProfessorWhittle's question concerning the validity of DAI'S whenother arms can change statewhen one arm is pulled,Meilijson (1975, privatecommunication) worked out a counter- example. The followingcontributions were received in writing,after the meeting. ProfessorE. M. L. BEALE(Scicon): Dr Gittinsis to be congratulatedon a clearexposition of a unifyingapproach to a narrowbut significant class of problems.This approachis presentedas an alternativeto DynamicProgramming, but the algorithmfor computing the DAI can equallybe regardedas an applicationof DynamicProgramming. This can be seenmost clearly when there is onlya finitenumber of possiblestates. The DAI v is definedas themaximum value of theexpected discounted net reward per unitof expecteddiscounted time, when we havethe option of givingup at anytime after the first stage. It is naturalto computethis by iterationin policyspace, i.e. by iterativeimprovement in theset C of statesfrom which we continue. Let Ri be thereward for continuing when in statei, pi, thetransition probability from state i to statej, and iothe initial state. Let Ck denotethe set of states from which we continueunder the kth trialpolicy, and letxi-W and wil)denote the discounted expected further reward and furtherduration respectively,when in statei. Thenx?) = w,k)= 0 if i f Ct, and otherwise xik) = Ri + a pij xj), (1)

wI) = 1 +a, Tpipw;1 (2)

Theseequations can be solvedfor xik) and wil),and v, can thenbe computedas Vj = (Rio+ a jpsj0 x(k))/(1+ ajpi0j w7k)). (3) I I A newcontinuation set Ck+1can thenbe definedby thecondition that i e C+1 ifand onlyif R1+ aj pi xik)> vk(l + a pij w?)). (4)

Withthis algorithm v+1 > vkand vk= v if Ck+l = Ck. The algorithmcan be streamlinedby writingy( ) = xik)-VkW-k)v Then from(1) and (2) we deducethat yjk)= Ri-Vk+a2:piIy?) ifi e Ck, (5) whilefrom (3) we deducethat

Rso-v+ ap,. yjk) = 0. (6) I 172 Discussionof Dr Gittins'Paper [No. 2, Here (5) and (6) are a set of linearsimultaneous equations from which the yjk) and vk can be computed.Condition (4) can thenbe written:i E Ck+1if and onlyif

Ri-vk + a p,p y3k) > 0. Whetheror not v is computedthis way, we can writethe equations defining the final values of v and yj in theform yi = max(0, Ri -v + a pij yj), (7)

Rio- v + a. pij= 0. (8) Theseequations can be justifiedfrom first principles. They apply whether the number of statesis finiteor infinite,and can be solvedby other means that nevertheless fall within the scope of Dynamic Programming. ProfessorR. BELLMAN(University of SouthernCalifornia): These are importantproblems. It is interestingto notethat the two-arm bandit problem can be usedas an exampleof learning. See Bellman(1971, 1978); Dreyfusand Law (1977). Thereis muchwork to be done in thearea of adaptiveprocesses. Miss J. M. CAULDWELL:One of our chemistsrecently calculated that there was a total of 8 x 1014 chemicalstructures in a serieswhich he was screening.At thepresent rate of progressit wouldtake 1-3 x 1012 yearsto testthem here. If he couldpersuade the total population of the world to help,this time could be reducedto about6000 years. Decision making in theearly stages of the screeningprocess is thereforevital. At somestage in thescreening process someone has to makea decisionabout which compound typeis showingno responseand is notworthy of furtherinvestigation, as opposedto one or more compoundtypes which are showingthe potential of reachingthe test target. Establishing the best line of follow-upwhen a seriesof compoundsis beinginvestigated is clearlya recurrentproblem whichhas proveddifficult to solve. Dr Gittinshas recentlyapplied the DAI theory to a setof data whichhad beencollected by some of ourchemists working on a particularresearch project. The resultsof the statistical analysis were of considerableinterest to thechemists concerned because, although the projectin questionhad beencompleted, the DAI theory picked out thosegroups which the chemists themselves had feltto be themost promising. The theorygave statistical support to whatthey felt were perhaps slightly woollyreasons for following up certaingroups and abandoningother groups. Furthermore,our chemistsrecognized the potential of the theory in assistingwith decision-making at an earlierstage in thescreening process, with the advantage of having some indication of the number of compounds thatwould have to be testedbefore finding one thatreached the test target. Dr P. W. JONES(University of Keele): I havetwo comments to make. If theoptimal policy may be obtainedby usingDAI'S for the situation where bandit processes arrive randomly in time,then presumablythis approach may now be used forthe optimalsolution of the problemof varietal selectionwhere varieties may be introducedat anystage in theselection procedure. In a note,Jones (1975), the Bernoulli two-armed bandit with finite horizon, no discountingand independentbeta priors was considered.The numericalwork presented concerned the performance oftwo suboptimal policies. The one-steplook ahead policywas found to be in excessof 99 percent efficientcompared with the optimal design. Using DAI'S in thiscase wouldgive an efficiencywhich is at leastas largeas this. Thisseems to suggestthat the considerable computational effort required usingDynamic Programming to obtainthe optimal policy is notworthwhile. The playthe winner rulewas also used and thishad an efficiencyof over90 per centfor all the cases, thisis rather surprisingsince this rule depends only on theprevious observation. In Freeman(1970) the Bayesian estimation,under quadratic loss of the medianeffective dose for up to threedose levelswas considered.This is, of course,a multi-armedbandit problem. It was foundthat the up-and-down methodof allocation, which is closelyrelated to theplay the winner rule, was in excessof 90 percent efficientin mostcases. In practiceone would accepta slightlysuboptimal rule which was easy to use. Thereforeit wouldbe interestingto investigatethe efficiency of simplerules analogous to play thewinner or 19791 Discussionof Dr Gittins'Paper 173 up-and-downrules for the Bernoullimulti-armed bandit problem with finite horizon or infinite horizonwith discounting. The playthe winner rule could be usedto switchfrom the current arm to a randomlychosen arm or alternativelythe play the winner rule could be usedto switchfrom the currentarm to that arm withthe largestexpected return for the next trial. To reduce the computationalcomplexity, perhaps a mechanismfor the early rejection of armscould be incorpor- ated. Is thereany evidence to suggestthat part of theoptimal policy for the multi-armed bandit behavesin a relativelysimple way? Dr P. NASH(Churchill College, Cambridge): The approachto theDAI theorem via forwards inductioncan be characterizedas a branch-and-boundcalculation. In deterministicdynamic programming,branch-and-bound methods attempt to overcomethe curseof dimensionalityby replacingthe optimal remaining reward function by an estimateof it whichis an upperbound (in a maximizingproblem). At each stage, the total remaining reward given any particular initial decision is estimatedas thesum of the one-step cost given this decision and theupper bound on theremaining rewardin thestate reached. The decisiontree is evaluatedby starting at theinitial point and taking at each stagethe decision for which the estimated total reward (including all theone-step costs for branchesalready traversed) is greatest.At anystage, attention centres on thatnode of thetree for whichthe sum of theestimated further reward and theone-step rewards obtained in reachingthat node fromthe initial point is greatest.Eventually, this node is a finaldecision point, and thenthe upperbound calculations imply that the path leading to thisnode has highertotal reward than any other.For a good enoughupper bound, this occurs long before all pathshave been evaluated. In contrast,the backwards induction of DP alwaysevaluates all paths. For a familyF of alternativebandit processes, an upper bound is (extendingthe notation of thepaper) B(O) = sup{v(D, x(0))}/(1- a) DGF Considerthe sequenceof decisionswhich fixes (D1, 1r), (D2, x2). One can show fromthe definitionof v(D) thatif thefirst decision is (D, T) and r(D, T) is themaximum expected reward giventhis initial decision, then r(D, i) < R7(D)+ B(O)E{ar}. The firststep of a branch-and-boundcalculation fixes a particularchoice of D1 and ir, the particularchoice being that which maximizes R71(Dl)+ B(O)E {atl} Thismeans choosing the process whose DAT iS equal to (1- a) B(O),and r1as thestopping time which yieldsthe supremum in thedefinition of theDAI. The forceof theforwards induction theorem is thenthat no decisionpath whose firstbranch does not coincidewith this one need ever be investigatedas we continueto branchand bound. This would seemto reinforcethe hope that forwardsinduction policies can be provedoptimal in moregeneral circumstances, since for that particularinitial decision to be optimal,we onlyrequire that paths which do not startwith it will ultimatelybe abandonedin the branch-and-boundprocedure. This is a weakerproperty than that by whichthe DAI theorem is proved.

ProfessorD. 0. SIEGMUND (StanfordUniversity): Typically dynamic programming problems are well understoodqualitatively but difficultto implementcomputationally. In thispaper Dr Gittinshas describedan interestingclass of problemsin whicha simplebut ingenious trick reduces thesecomputational difficulties to manageableproportions. A givenproblem is replacedby a familyof optimalstopping problems, which are mucheasier to solve. This producesa "splitting" ofthe given problem into independent components, the individual solutions to whichmay be glued togetherto solvethe original. The keytechnical idea is thatof the DAI. Givena banditprocess, the DAI iSintuitively that value Awhich makes one indifferentbetween accepting an immediatereward of Aand optimallystopping thebandit process with a residualreward of A discountedby at ifstopping occurs at timet. The followingexample seems instructive. Let armone of a MABreturn 1 or 0 independentlyon each trialwith known probability p. Let armtwo returnonly ones withprobability 7T and only zeros with probabilitywo. Then the DAI for arm two satisfiesA = Xr1/(1-a) + iroaA, and if A

7r1/(1 - a) + 7o avi/(l - a) + ... + 10N-1 aN_1 rl/(l- a) cl/(l- a) (1-foa)= A, so armone remainsoptimal. The class of problemsfor which the methodsof thispaper are applicableis special,albeit important.It wouldbe interestingto knowhow wellthis class mightserve to approximateother problems.For example,the results do not appearto applydirectly to multi-armedbandits with correlatedprior distributions. However, for discount factors close to one and a relativelysmall numberof arms,perhaps not too muchis lost. Are thereanalogous results for an averagereturn criterion,which by the relationof Cesaro to Abeliansummability is relatedto the discounted returncriterion? The potentialapplications to clinicaltrials are verythought-provoking, but theiracceptance in practicemay hingeon considerationsapparently not amenableto systematicdecision theoretic treatment(e.g. the desirabilityfor randomization as an (the ?) importantaspect of experimental design). Finally,the reader stimulated to studythe proof of theDAI theorem (Gittins and Jones,1974a) shouldbe warnedthat Lemma 2 of thatpaper appears to havea crucialinequality reversed. ProfessorB. W. TURNBULL(Cornell University): I have been acquainted with the author's work on DAI'S forsome time and thispaper gives a veryreadable account of whatis an interestingand significantcontribution to the theoryof sequentialdecision processes and sequentialdesign of experiments.I wonder whether the theory can be adapted,as in Section10 perhaps,to handlethe problemwhere, at each stage,one optionis to freezeeternally all therival bandit processes and take a terminalreward which depends on a termminaldecision to be takenthen. If so, it wouldbe of interestto comparethe DAI rules with the asymptotically optimal procedures of Bessler(1960) who took a sequentialgame theoreticapproach. Unlikethe DAI procedure,Bessler's rules have the propertyof beingrandomized which is an advantagein clinicaltrials because of the problemof selectionbias. Of course,in otherapplications, non-randomized rules may be preferable. In referringto Robbinsand Siegmund(1974), the author alludes to theselection formulation of then-armed bandit problem where it is desiredto finda procedurethat maximizes cumulative one-stagerewards from among that class ofrules that eventually stop and selectthe best treatment withprescribed error probabilities. Also of interesthere are the asymptoticallyoptimal rules of Louis (1975,1977). Thesepapers all deal onlywith the case n = 2; forn > 3, similarmethods can be used butthere are somedifficulties (Turnbull et al., 1978). The proposeduse of adaptivesampling in medicaltrials in practicehas been muchcriticized recently(Bailer, 1976; Simon,1977). Two objectionsgiven are: (A) Althoughadaptive sampling can lead to fewerexpected number of patientson inferior treatments(ITN), it increasesthe totalexpected sample size (ASN) comparedto a non-adaptive method.This delays conclusion of the trial and perhapsadversely affects patients not part of the trial. (B) Adaptivesampling rules are too complicated. In response,it shouldbe notedthat (A) is onlytrue for n = 2; forn k 3 substantialsavings in bothASN and ITN can be achievedsimultaneously by use of adaptivesampling. This is demon- stratedin Turnbullet al. (1978)and is intuitivelyclear because non-contending treatments can now be droppedfrom consideration early. In responseto (B), it mightbe notedthat adaptive allocation of patientsto treatmentsbased on previousresponses need not be muchmore complicated than the adaptiveallocation rules, based on prognosticvariables, designed to maintainbalance in a stratified study.Yet thelatter type of adaptive procedure is gainingacceptance in practice,e.g. in multi-clinic trials.Finally, since ASN as wellas ITN can be reduced,adaptive sampling might be applicablein animalexperiments where statistical considerations can playa greaterrole in thedesign and conduct of thestudy. The AUTHORreplied later, in writing,as follows. For me at any rate the discussionhas been most interesting,and I shouldlike to beginby thankingthe proposer and seconderof thevote of thanks,and indeedall theparticipants, for their contributionsand fortheir kind words. 19791 Discussionof Dr Gittins'Paper 175 ProfessorBather raises the question of thesensitivity of thesolution to themulti-armed bandit problemto changesin theprior distribution and thediscount factor. The simplestgeneralization (Gittinsand Jones,1979) is thatthe gap betweenany iSO-DAI, and theline throughthe origin to whichthe iSO-DAI is asymptoticallyparallel, is alwaysless than4-0 fordiscount factors which are not greaterthan 099. This meansthat, except for small values of otand g, theoptimal policy is well approximatedby one whichalways selectsthe arm for whichthe posteriorexpectation (ao+ 1)f(ax+ g + 2) of the unknownsuccess probability is largest. Call this policyA. To this approximation,then, the solution is robustto changesin thediscount factor. The effectsof changes in theprior distribution on policyA are measurableas constantchanges in a, and f forthe arm concerned.As formost Bayesian procedures, the precise choice of prior distribution is not crucial, butpriors which differ by assigninghigh probabilities to differentregions of theparameter space (thesecorrespond to highinitial values of axand g) lead to substantiallydifferent procedures. The calculationsreported by Jones(1975) and Wahrenbergeret al. (1977) show thatfor the finite horizonundiscounted problem policy A again does well,and is notunduly sensitive to changesin theprior distribution. The randomizedallocation indices proposed by ProfessorBather are variationsof policyA, and theirgood performanceis thus not altogether surprising. The deviceof randomization leads to theasymptotic optimality property which he describes,and which,as Dr Polonieckipoints out, the Bayespolicy based on DAI'S does nothave. A thoroughgoingBayesian would not, of course, regard thisas a particularlystrong objection. However, it would be interestingto examinethe performance policiesobtained by addinga randomcomponent to the DAI, ratherthan the proportionof successes,for each arm. In thisway it mightbe possibleto havethe best of bothworlds. Extensive calculationsof the DAI functionhave been carried out for various values of the discount factor, and are describedby Gittinsand Jones(1979). It can actuallybe shownthat for any values of of and ,B the DAI tendsto oneas thediscount factor tendsto one. This meansthat the gap betweenan iSO-DAI and the asymptoticallyparallel line throughthe origin must tend to infinity,despite the above-mentioned unremarkable behaviour for discountfactors up to 0-99. The behaviourof the iSO-DAI'S in the limitis an intriguingopen question,as (Berry,1972) is the natureof the optimalpolicy for the undiscountedcase as the horizontends to infinity,though the practical significance may not be particularlygreat in either case. As ProfessorBather says, there is a noticeablelack of enthusiasmamong medical statisticians forallocation rules designed to reducethe number of patientsgiven inferior treatments in clinical trials. My impression,like thatof ProfessorSiegmund, is thatthis is largelyattributable to the importanceattached to randomizationas a means of removingbias. However,pressure from medicalpractitioners and fromgovernments may lead to a changeof attitude.Professor Turnbull also makessome interesting comments on thispoint. The resultmentioned by Professor Whittle is intuitivelyappealing. The quantitiesy. and S. are naturalmeasures of thecost of progresstowards a terminaldecision, under H1 and H2 respectively, whenexperiment u is used. Thus one mighthope to findan elementaryderivation. However, I havebeen unable to findan interpretationofthis rule as a forwardsinduction policy, and wouldbe inclinedto look forfor an appropriategeneralization of Wald's equation. The remarksof Mr Davies,Mr Bakerand MissCauldwell refer primarily to theset of DAT tables preparedby Gittins and Jones(1974b) as an aid innew-product chemical research. It is encouraging to hearfrom them of scopefor practical application. I am in theprocess of analysingseveral sets of compoundscreening data providedby pharmaceuticalcompanies with the help of thesetables. The findingsof thisexercise will be reportedin due course. Mr Davies also raisesthe question of whatto do ifa banditprocess improves as a resultof the researchteam's increased skill in selectingnew compounds. My feelingis thatsuch changes can bestbe takeninto account by calculating the DAI on thebasis of recent results only, rather than by modellingthe learningprocess itself. Of course,as ProfessorBellman remarks, the modelsdo incorporatean aspectof learning,but this is notthe one to whichMr Davies refers. The computer-basedprocedure mentioned by Dr Robertsis also designedas a learningmodel, thistime for the purposeof dividingresources between different new-product chemical research projects.I have beenfollowing its development with interest. I shouldlike to congratulateDr Kellyon findingtwo ingeniousnew applicationsof the DAI theoremin theform which allows the time between successive decision points to dependon thestate 176 Discussionof Dr Gittins'Paper [No. 2, of thebandit process which is currentlybeing continued. As Dr Weisssurmises, the theorem still holdsif this time is also allowedto be random,a resultwhich is givenin a slightlydifferent form by Gittinsand Nash (1977). The strikingfeature of Dr Kelly'stwo examples is hisuse of thisvariable timeto representfirst a cost,and thena probability,thereby establishing results for situations where thethings which look mostlike bandit processes do not functionindependently. For hisfirst example of a hiddenobject, with no costs,but a rewardof atif it is foundat timet, each box takingone unitof timeto search,the expected reward under an arbitrarydeterministic policyis ,n co J-1 (* I P(i) E n (1- d(i, k)) d(i, j) at(itl). i-1 j=1 k-i Here thetime at whichthe jth searchof box i takesplace, if the object has notbeen found before then,is denotedas t(i,j). We notethat the expression (*) is also theexpected total rewardfor a familyof n alternativebandit processes for each of whichthe state coincides with the process time, undera policywhich, for all i andj, continuesbandit process i forthe ith timeat timet(i,j). To make thisinterpretation we mustlet the undiscountedreward from continuing bandit process i whenit is in statej be P(i) 17l(1- d(i, k)) d(i,j). k-1 The optimalpolicy for both problems is thereforeexpressible in termsof DAI's. For thecase when thetime taken by the jth searchof box i is c(i,j) we simplyreplace t(i,j) by EJ=,c(i, k) in (*), and makethe corresponding change in the expression for the DAI. Lettinga tendto one inthis expression leads to theindex v(i) givenby Dr Kelly. Thus a policybased on thisindex must be such as to minimizethe term of order1 - a as a tendsto I in (*), and thisis preciselywhat is requiredfor the originalundiscounted search problem for which c(ij) is thecost of thejthsearch of box i. Indeed a neatpiece of work,and I hope Dr Kellywill not mind my filling in thesedetails. ProfessorFristedt and Dr Polonieckidraw attention to thepossibility that the iterative methods of calculationwhich I havedescribed may lead to unacceptableaccumulation of errors.This is an importantconsideration, and checksmust be incorporatedin anyset of calculationsto ensurethat thisdoes nothappen. Dr Glazebrookindicates three areas of currentand prospectiveresearch interest His paperon stoppablefamilies of alternativebandit processes provides a partialanswer to a questionraised by ProfessorTurnbull. It extendsthe DAI theoremto stoppablefamilies under a certaincondition, whichincludes monotonicity conditions as specialcases. Dr Glazebrooksuggests using a hamiltonianapproach to solve continuous-timesequential allocationproblems, along the lines of Nash and Gittins(1977). This is certainlya line worth pursuing,and theaccount given by Nash (1973) is stillworth reading, not leastfor its discussion (Section2.4) of themulti-server problems to whichDr Dempsterrefers. It seemsto me,however, thatwe couldalso do witha generaltheorem for translating discrete-time results into their obvious continuous-timeanalogues. The entiretheory of Markovdecision processes has a gap at thispoint. Dr Dempstergives a usefuloutline of currentwork on multi-processorscheduling problems. As he says,this is an excitingarea in whichmuch remains to be done. I conjecture,for example, that conditionswhich ensure that the policy whichminimizes expected average flow-time is expressiblein termsof a DAI, whenno newjobs arrive,will also ensurethis when the arrivals of new jobs forma Poissonprocess. For the singleprocessor case thishas alreadybeen establishedby Nash (1973),as a consequenceof the DAI theorem,and independentlyby Meilijson and Weiss(1977), who used a neat,and entirelydifferent, inductive argument. As Dr Jonessays, there is a possible applicationin varietalselection, though here the calculation of DAI'S maypresent serious problems. It is interestingto notethat for this result the criterion is theaverage return per unit time, and was establishedby Nash fromthe corresponding discounted return problem by letting the discount factortend to one. I shareProfessor Siegmund's view that there must be moregeneral results of thistype waiting to be proved. ProfessorBeale presentsan attractivealgorithm for carrying out thecalculations outlined in Section8, whichI am surewill prove useful. His equation(7) is equivalentto equation(13) of the paper,and hisequation (8) is implicitin thetext of thefollowing paragraph. As he says,this is all dynamicprogramming, but I standby myassertion that "forwards induction policies are often 1979] Discussionof Dr Gittins'Paper 177 easierto determinethan backwards induction policies", and thatit is thereforeworth knowing whenforwards induction policies are optimal.Dr Nash's characterizationofforwards induction as a branch-and-boundcalculation supports this view. This is a connectionwhich itself warrants furtherinvestigation. Finally,I shouldlike to thankProfessors Fristedt and Siegmundfor clarifying several points.

REFERENCESIN THE DiscussIoN BALLAR,J. (1976). Patientassignment algorithms: an overview.Proc. 9thInt. Biometric Conf., I, 189-203. BELLMAN,R. (1971). Introductionto theMathematical Theory of ControlProcesses, Vol. II. New York: AcademicPress. -- (1978). An Introductionto ArtificialIntelligence: Can ComputersThink? San Francisco:Boyd and Fraser. BERRY,D. A. (1972). A Bernoullitwo armedbandit. Ann.Math. Statist., 43, 871-897. BESSLER,S. (1960). Theoryand applicationsof thesequential design of experiments, k actions and infinitely manyexperiments. Technical Reports Nos. 55, 56, Dept. of Statistics,Stanford University. DEMPSTER,M. A. H. (1979). Some programsin dynamicstochastic scheduling. Presented at 2nd Int. Conf. on StochasticProgramming, Oberwolfach, Germany, February 2nd, 1979. DREYFUS,S. and LAW,A. M. (1977). TheArt and Theoryof DynamicProgramming. New York: Academic Press. FABIus,J. and VAN ZWET, W. R. (1970). Some remarkson thetwo-armed bandit. Ann.Math. Statist.,41, 1906-1916. FREEMAN, P. R. (1970). OptimalBayesian sequential estimation of themedian effective dose. Biometrika, 57, 79-89. GITTINS,J. C. (1979). Sequentialstochastic scheduling with more than one server.Math. Operationsforschung (in press). GLAZEBROOK,K. D. (1979). Stoppablefamilies of alternativebandit processes. J. Appl.Prob. (in press). GRAHAM,R. L., LAWLER,E. L., LENSTRA,J. K. and RINNOOYKAN, A. H. G. (1977). Optimizationand approximationin deterministicsequencing and scheduling:a survey. TechnicalReport BW 82/77, MathematischCentrum, Amsterdam. To appearin Proc. of DiscreteOptimization 1977, August 8th-12th, 1977. HARRISON, J. M. (1975). Dynamicscheduling of a multiclassqueue: discountoptimality. Oper. Res., 23, 270-282. JONES, P. W. (1975). The two armedbandit. Biometrika,62, 523-524. KADANE,J. B. and SIMON, H. A. (1977). Optimalstrategies for a class of constrainedsequential problems. Ann.Statist., 5, 237-255. Lovis, T. A. (1975). Optimal allocation in sequential testscomparing the means of two Gaussian populations. - (1977). Sequentialallocation in clinicaltrials comparing two exponential survival curves. Biometrics, 33, 627-634. MEILIJSON, I. and WEiss, G. (1977). Multiple feedback at a single server station. Stoch.Proc. Applic's,5, 195-205. POLONIECKI,J. D. (1978). The twoarmed bandit and thecontrolled clinical trial. Statistician,2, 97-102. ROBBINS, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer.Math. Soc., 68, 527-535. SIMON,R. (1977). Adaptivetreatment assignment methods and clinicaltrials. Biometrics, 33, 743-749. TURNBULL,B. W., KASPI, H. and SMITH, R. L. (1978). Adaptive sequential procedures for selecting the best of several normal populations. J. Statist. Comput. Simul., 7, 133-150. WALRAND,J. and VARAIYA,P. (1978). The output of Jacksonian networks are Poissonian. Memo ERL- M78/60. Electronics Research Laboratory, Universityof California at Berkeley. WEBER,R. R. (1979). Scheduling stochasticjobs on parallel machines. Cambridge University. WEBER,R. R. and NASH, P. (1978). An optimal strategyin multi-serverstochastic scheduling. J. R. Statist. Soc. B, 40, 322-327. WEISS, G. and PINEDO, M. (1978). Scheduling tasks with exponential service times on non-identical processors to minimize various cost functions. Technical Report ORC 78-16. Operations Research Centre, Universityof California at Berkeley. (To appear in J. Appl. Prob.) WHITTLE,P. (1965). Some general results in sequential design. J. R. Statist.Soc. B, 27, 371-394.