A Hardware Efficient Implementation of a Boxes Reinforcement Learning System
Total Page:16
File Type:pdf, Size:1020Kb
A Hardware Efficient Implementation of a Boxes Reinforcement Learning System Yendo Hu and Ronald D. Fellman Department of Electrical and Computer Engineering. 0407 University of California at San Diego December 21,1993 Abstract through the previously traversed system states. The effect This paper presents two modifications to the Boxes- of the reinforcement signal on a control parameter ASEIACE reinforcement learning algorithm to improve decreases exponentially with the elapsed time since the implementation efficiency and performance. A state his- system had last entered that state. In [2], a single ASE, tory queue (SHQ) replaces the decay computations asso- along with one ACE, successfully learned to solved the ciated with each conwl state, decoupling the pole balancing problem. Many other researchers have dependence of computational demand from the number also used a pole-cart balancer to benchmark the of conml states. A dynamic link table implements performance of their algorithms [81 [ll. CMAC state association to training time, yet decrease The ASE assigns a register to hold an output control minimize the number of control states. Simulations of value for each unique system state. Taken together, these the link table demonstrated its potential for minimizing registers form a control table that maps a plant state to a control states for unoptimized state-space quantization. control action. Each register holds the long-term trace Simulations coupling the link table to CMAC state asso- that represents both the output action, the trace’s sign, ciation show a 3-fold reduction in learning time. A hard- and also a confidence level, the trace’s magnitude. Thus, ware implementation of the pole-cart balancer shows the high confidence levels are represented by large trace SHQ modification to reduce computation time 12-fold. magnitudes. The ASE adjusts only the traces of those Introduction states that led to the reinforcement event. The Boxes network, developed by Michie and A second value for each state, the short-term Chambers [7] and later refined by Barto et al. [2], is a memory trace, tracks the contribution of a state towards reinforcement learning algorithm designed to solve producing the current system state. For each state, this difficult adaptive control problems using associative short-term trace weighs the reinforcement adjustment of search elements (ASE) and adaptive critic elements the long-term trace value. This mechanism helps the (ACE). The equations of motion of the physical system ACE use past states to leam from a system failure in are not known to the network. Rather, it learns how to proportion to their contribution to the cutrent outcome. respond based upon feedback from past trials. This feedback evaluates system performance from a failure An ASE contains one control output (O(t)), one signal occurring when the controlled object reaches an reinforcement input (i (t) ), and a decoded input state undesired state. vector (&(t)).Each element of &(t) represents a unique state within the system. For each Ii(t), the i* element is 1 As shown in Figure 1, the ASE acts as a control and all other elements are 0. The ASE output, which table that uses the current system state as an address to controls the system, is just the thresholded long term retrieve a control action for the plant. The resulting trace selected by the decoded input state vector as in (1). action may generate a reinforcement signal, usually negative, that the ASE and ACE receives. On the basis of a first-order linear prediction from past reinforcement signals, the ACE uses this catastrophic reinforcement I signal to compute a prediction of the reinforcement signal when the plant produces no actual reinforcement. The ASE updates its control information using both this improved reinforcement signal along with a trace ’Ihb wok was funded by heMicroelectronic Information Processing Systems Division of the National Science Foun- dation. RIA Award #MP-9008839. Figure 1. Example of a simple ASE and ACE system. Rather than store a decay value for each system state, the SHQ scheme keeps only a truncated list of all recently traversed input states. Weighting these states by position in €)(a) = 1 if a 2 0, else 9(a) = -1. the history queue generates an equivalent decay value. By restricting the length of this list, the SHQ scheme, in effect, The following equations recursively relate WOintemally sets a threshold to truncate insignificant decay values. It stored variables: the long term trace, wi(t), and the short term ignores weight adjustments of those states whose last visits trace, ei(t), to each input, i: occurred beyond a time period specified by the list length. w.(t) = w.(t-1) +ar(t-l)e.(t-1) (2) Thus for each network update, only variables associated with 1 1 1 the states listed in the queue are adjusted. This limited ei(t) = 8ei(t-1) + (l-ti)I.(t-1)O(t-l) 1 (3) updating decouples computation time from the size of the input state space. It not only eliminates unnecessary memory where r(t) = reinforcement signal, accesses and decay computations to these discarded states. 6 = ASE trace decay rate a = positive constant determining rate of change The state history queue is just a shift register that keeps track of the past inputs states (Figure 2). The register depth, The ACE contains one modified reinforcement output, H, defines the length of the queue and sets the time threshold, one reinforcement input, and a set of inputs equaling the beyond which, incremental weight adjustments assumed number of outputs from the front end decoder. %o internally ate insignificant, and ignored. The ASE's state history queue stored variables, jIi(t), the time decay factor, and vi(t), the records the decoded input address and the generated outputs. state predictor, recursively update their values each cycle from the current system state. The change in the system An exponential decay function is approximated by predictor, p(t), provides feedback to update the variables, vi(t) assigning linearly decreasing decay scalars to each register and i (t) . The following equations describe the operation of within the queue. The approximated decay function for each the ACE: input state is just the sum of all decay scalars of registers holding the input state. This decay function replaces the Xi(') = qt-1) + (l-A)Ii(t-l) (4) decay variables ei(t) and Zi (t) in the original algorithm. A Vi(t) = vi(t-l) +p3(t-1)Xi(t-1) (5) linear function to represent the set of constant decay scalars further reduces the complexity of the algorithm. When a N system dwells in the Same input state long enough to fill the P(t) = c. (Vi(t) XIi(t)) (6) queue, the SHQ step response approximates the first two i=l terms in a Taylor expansion of the exponential response i(t) = r(t) +yp(t) -p(t- 1) (7) given by (3): where h = ACE trace decay rate, e(l) = (l-&t)u(t) =tHK- (t2-t)22 N = number of possible input states ( 8) y = discount factor, where 6 = trace decay rate, u(t) = step function p = positive constant determining rate of change H = length of the state history queue, During operation of the pole-cart balancer, the system K = decay scalar rate of change. first quantizes each of four input variables: pole angular The simulations in the following sub-sections confirm velocity, cart position, cart velocity, and the reinforcement that this approximation introduces only a small difference to feedback. The quantized system parameters then pass the overall system performance. through the decoder and activate the unit state vector representing the current system state. The ASE/ACE network Our specific implementation of the SHQ is given by (9) learned to balance the stick within an average of 70 trials [2]. and (10). They replace equations (2) and (4), respectively. State History Queue w[SHQ.A[hll,, = The short-term trace or eligibility function, ei(t), in the W[SHV.A[~II,,~~+a Wt) (h+l)(SHQ.O[hl)~a,, (9) ASE, and the state trace, xi(t). in the ACE, decay Address Input State output Decayscalar exponentially, and, thus, only carry significant values over a limited time period. All computations involving these variables are influential only during this period. The state history queuing (SHQ) scheme takes advantage of this characteristic to reduce both memory and computation time. Figure 2. The state history queue structurefor the ASE. 2298 Operations Per Update De6nitiondVUiabb.: M L rarl rmmbcr dpodbkkptn EULI H = h@hd SUE blny Qvvc Figure 4. Comparative effectiveness of the SHQ. Figure 3. Step responses of e,(!) & SHQ decay function. cycle; whereas the modified network with the SHQ only where: a =rate of change constant in ASWACE need update some elements within these arrays. Figure 4 gives the number of multiplication and addition h =pointer in SHQ, 01 h c Has, -1 operations required to update each array. K~~~ = SHQ weight, w = control weights (addressable memory) From equations (2) and (9). the time efficiency factor becomes: v[SHQ.A[~II,, = (4M + 2) Ta + (M + 2) Tm ~[SHQ.A[hll~id - p(t-1)1~,, (10) +P (13) where: h = pointer in SHQ, 0 I h < H,, - 1, TFshq = 2TsH + (3H + 2) Ta + (4H + 2) Tm y= discount factor (a constant) where Ts = time to perform one shift operation, v = prediction values (addressable memory) 'ra =: Time to perform one addition operation Matching the SHQ to the Eligibility Function Tm= Time to perform one multiply operation. Two parameters control the effective time decay in The memory needed for the original network must the SHQ scheme: K, the decay slope, and H, the register be large enough to hold the four arrays: e, X, w, and v, length.