Hidden Markov Model Multiarm Bandits: a Methodology for Beam Scheduling in Multitarget Tracking Vikram Krishnamurthy, Senior Member, IEEE, and Robin J
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 49, NO. 12, DECEMBER 2001 2893 Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking Vikram Krishnamurthy, Senior Member, IEEE, and Robin J. Evans Abstract—In this paper, we derive optimal and suboptimal beam scheduling algorithms for electronically scanned array tracking systems. We formulate the scheduling problem as a multiarm bandit problem involving hidden Markov models (HMMs). A finite-dimensional optimal solution to this multiarm bandit problem is presented. The key to solving any multiarm bandit problem is to compute the Gittins index. We present a finite-dimensional algorithm that computes the Gittins index. Suboptimal algorithms for computing the Gittins index are also presented. Numerical examples are presented to illustrate the algorithms. Index Terms—Dynamic programming, hidden Markov models, optimal beam steering, scheduling, sequential decision procedures. Fig. 1. Multitarget tracking with one intelligent sensor. The scheduling problem involves which single target to track with the steerable electronic scanned array (ESA). I. INTRODUCTION AND PROBLEM FORMULATION ONSIDER the following hidden Markov model (HMM) work considers one important part of a complete agile beam C target tracking problem: targets (e.g., aircraft) are tracking system. We show that the multiarm bandit formulation being tracked by an electronically scanned array (ESA) with can provide a rigorous and potentially useful approach to the only one steerable beam. The coordinates of each target problem of scheduling an agile beam among a finite number of evolve according to a finite state Markov established target tracks. chain (see [27]), and the conditional mean estimate target state The multiarm bandit problem is an example of a dynamic sto- is computed by an HMM tracker. We assume that the target chastic scheduling problem (resource allocation problem). The coordinates evolve independently of each other—that is, the “standard” multiarm bandit problem involves a fully observed finite state Markov chain and is merely a finite state Markov Markov chains are independent of each decision process (MDP) problem with a rich structure. We will other. Since we have assumed that there is only one steerable refer to this standard problem as the Markov chain multiarm beam, we can obtain noisy measurements of only one bandit problem. Numerous applications of the Markov chain target at any given time instant. Our aim is to answer the multiarm bandit problem appear in the operations research and following question: Which single target should the tracker stochastic control literature; see [15] and [30] for examples in choose to observe at each time instant in order to optimize some job scheduling and resource allocation for manufacturing sys- specified cost function? Fig.1 shows a schematic representation tems and [5] for applications in scheduling of traffic in broad- of the problem. band networks. Several fast algorithms have been recently pro- In this paper, we give a complete solution to this problem. We posed to solve this Markov chain multiarm bandit problem (see formulate the problem as an HMM multiarm bandit problem. [16] and [29]). Then, a finite-dimensional optimal algorithm and suboptimal However, due to measurement noise at the sensor, the above algorithms are derived for computing the solution. multitarget tracking example cannot be formulated as a stan- The problem stated above is an integral part of any agile beam dard Markov chain multiarm bandit problem. Instead, we will tracking system [1], [6], [7]. Practical implementation of such formulate the problem as a multiarm bandit problem involving systems is quite complex, and beam scheduling must handle HMMs, which is considerably more difficult to solve. To de- a variety of search, identification, and tracking functions. Our velop suitable algorithms for the tracking problem, in this paper, we first give a complete solution to the multiarm bandit problem Manuscript received February 1, 2000; revised September 11, 2001. This for an HMM. In the operations research literature, such prob- work was supported by an ARC large grant and the Centre of Expertise in Net- worked Decision Systems. The associate editor coordinating the review of this lems are also known as partially observed Markov decision pro- paper and approving it for publication was Prof. Douglas Cochran. cesses (POMDPs) [20], [21]. The authors are with the Department of Electrical Engineering, University Our paper focuses on designing optimal beam scheduling of Melbourne, Parkville, Victoria, Australia (e-mail: [email protected]; [email protected]). algorithms by formulating the problem as an HMM multiarm Publisher Item Identifier S 1053-587X(01)10488-5. bandit problem. Other applications of the HMM multiarm 1053–587X/01$10.00 © 2001 IEEE 2894 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 49, NO. 12, DECEMBER 2001 bandit problem include the above Markov chain multiarm these observations belong to a finite set indexed by bandit applications when noisy observations of the state are . Denote the observation history2 at time available, robot navigation systems [13], learning behavior as [24], and fault-tolerant systems [14]. A. HMM Multiarm Bandit Problem 1) Markov Chain Multiarm Bandit Problem: To moti- Let denote the observation proba- vate the HMM multiarm bandit problem, we first outline bility (symbol probability) matrix of the HMM, where each el- the well known finite state Markov chain multiarm bandit ement . Our aim is problem [4], [18]. Consider independent projects (or targets) to solve the HMM multiarm bandit problem, i.e., determine the . At each discrete-time instant , only one of optimal policy that yields the minimum cost in (2) these projects can be worked on. We assume that each project has a finite number of states . Let denote the state of (3) project at time . If project is worked on at time , one incurs an instantaneous cost of , where Note that unlike the standard Markov chain bandit problem, in denotes the discount factor; the state evolves according the HMM case, the class of admissible policies comprises of to an -state homogeneous Markov chain with transition the set of sequences that map to ; see [4] for probability details. Let denote the optimal cost. Remark: Because of the assumptions that and are finite sets and are uniformly bounded from above if project is worked on at time (1) and below, the optimization (3) is well defined [20]. Let denote the transition probability matrix . B. Information State Formulation of HMM Multiarm Bandit The states of all the other idle projects are unaffected, Problem i.e., The above HMM multiarm bandit problem is a infinite horizon POMDP with a rich structure that considerably sim- if project p is idle at time plifies the solution, as will be shown later. First, however, as is standard with partially observed stochastic control prob- Let the control variable denote which lems—we convert the partially observed multiarm bandit project is worked on at time . Consequently, denotes the problem to a fully observed multiarm bandit problem defined state of the active project at time . Let the policy denote in terms of the information state; see [4] or [18] for the general the sequence of controls . The total expected methodology. discounted cost over an infinite time horizon is given by For each target , the information state at time , which we will denote by -column vector of dimension , is defined as the conditional filtered density of the Markov chain state (2) given the observation history and scheduling history where denotes mathematical expectation. For the above cost function to be well defined, we assume that is uni- (4) formly bounded from above and below (see [4, p. 54]. Define as the class of all admissible policies . Solving the Markov chain multiarm bandit problem [29] involves com- The information state can be computed recursively by the HMM puting the optimal policy , where denotes the state filter (which is also known as the “forward algorithm” or set of sequences that map to at each time in- “Baum’s algorithm” [23]) according to (5). stant.1 In terms of the information state formulation, the above 2) Preliminary Formulation of HMM Multiarm Bandit HMM multiarm bandit problem can be viewed as the following Problem: In this paper, motivated by the agile beam sched- scheduling problem. Consider parallel HMM state filters: uling application, we consider the conceptually more difficult one for each target. The th HMM filter computes the state HMM multiarm bandit problem where we assume that the state estimate (filtered density) of the th target . of the active project is not directly observed. Instead, if At each time instant, the beam is directed toward only one of , noisy measurements (observations) of the active the targets, say, target , resulting in an observation . project state are available at time . Assume that 2Our timing assumption that u determines y is quite standard in sto- 1Throughout this paper, we will deal with minimizing the cost function t . chastic optimization [4]. It is straightforward to rederive the results in this paper Clearly, this is equivalent to maximizing the reward function t . if u determines y . KRISHNAMURTHY AND EVANS: HIDDEN MARKOV MODEL MULTIARM BANDITS 2895 This is processed by the th HMM state filter, which updates , because no measurement is obtained for the other its estimate of the target’s state as targets, we assume the estimates of the state of each of these unobserved targets remain frozen at the value when each target was last observed. Note (5) that this is far less stringent than assuming that the actual state of the unobserved target did not evolve, which is required for the standard fully observed multiarm bandit problem. In other words, the information state formulation reduces to if beam is directed toward target , where if , then assumptions on the tracking sensor behavior (which we can diag is the diagonal matrix formed control) rather than the actual underlying Markov chain or by the th column of the observation matrix , and is an target (which we cannot control).