Quantum Reinforcement Learning

Home , Action selection

arXiv:0810.3828v1 [quant-ph] 21 Oct 2008 sas nefcieepoaino h plcto fquant of application the on intelligence. work exploration artificial present to effective The computation problems. an complex also superior and some and is for effectiveness g performance algorithm the are demonstrate QRL the experiments results through simulated evaluate the several learning and To QRL, up of speed parallelism. practicability can quantum and appro the amplitude this exploitation probability that and exploration the shows between tradeoff which good a analyzed, makes a also exploratio are between updated suc balancing exploitation QRL parallelly and of characteristics is optimality related convergence, which Some determi rewards. amplitude, quantu is to action cording probability of eigen simulated postulate the the the collapse by of (eige the observing probability state The to randomly measurement. eigen according by the state obtained represen and quantum be be state can can set superposition action) (action) quantum state eige a the The as with QRL. identified in is action) RL introd traditional (eigen is p in algorithm (action) quantum updating state value and The (RL). of principle framework learning superposition a state reinforcement allelism, the and by by proposed theory Inspired nov is a quantum method paper, (QRL) combining this learning In rep reinforcement mechanisms. new quantum computation are environments and probabilistic sentations unknown in learning oiin olpe rbblt mltd,Goe iterat Grover amplitude, probability collapse, position, otat Lue clrvlenamed In value data. scalar outputs. input a to the uses inputs on RL from processes contrast, map only b a learning gives provided Unsupervised and feedback pairs explicit input-output requires learning Supervised ajn nvriy ajn 103 hn eal clchen@ (email: China 210093, Nanjing University, Nanjing h nu-uptpisadlan apn from mapping a learns and pairs input-output the L fSsesSine MS hns cdm fSine,Beij Sciences, of [email protected]). Academy (email: Chinese China AMSS, Science, Systems of n yagatfo G fHn og(CityU:116406). Kong Hong of RGC from grant Educati Wong Foundatio a Science C. by Postdoctoral K. and China the the 60703083, Kong), No. (Hong Grant On tion under China applications. are of tion practical there in However problems systems. difficult nonlinear some complex still t learni to powerful due and ability artificia [18], adaptation on-line in [7]-[10], of used approach performance robotics good widely important in is an especially and become [1]-[22], intelligence, learning has machine RL to 1980s, Since error. actions ahntnUiest nS.Lus t oi,M 33 USA. 63130 MO Louis, St. Louis, St. in University Washington iyUiest fHn og ogKong. Hong Kong, Hong of Engineeri and University Engineering City Manufacturing of Department the .L swt h eta ot nvriy hnsa410083, Engineer Changsha System University, South and Central Control the with of is Department Li H. the with is Chen C. .Dn swt h e aoaoyo ytm n oto,Ins Control, and Systems of Laboratory Key the with is Dong D. hswr a upre npr yteNtoa aua Scien Natural National the by part in supported was work This .J ani ihteDprmn fEetia n ytm En Systems and Electrical of Department the with is Tarn J. T. ne Terms Index Abstract ANN ehd r eeal lsie nosuper- and into unsupervised classified generally vised, are methods EARNING yitrcinwt h niomn hog trial-and- through environment the with interaction by h e prahsfrmcielann,especially learning, machine for approaches key The — unu enocmn erig tt super- state learning, reinforcement quantum — .I I. unu enocmn Learning Reinforcement Quantum NTRODUCTION ay og hni hn axogL,Ty-ogTarn Tzyh-Jong Li, Hanxiong Chen, Chunlin Dong, Daoyi enocmn learning reinforcement reward (20060400515) n gManagement, ng nju.edu.cn). oevaluate to n 100190, ing nFounda- on ion. eFounda- ce states hn and China gineering, state n and n (RL). t of ity using uced. its o titute iven as h ned ach um ing, ted ar- re- ng to m c- el n y e l rbe steepoainsrtg,wihcnrbtsalo a of contributes balancing which strategy, better exploration the is problem taeist n etrplc)and policy) better find to strategies rsne unu loih o obntra optimiza- combinatorial for algorithm [4 quantum Portnov the a and improve Hogg presented to [39]. algorithms proposed quantum-inspired t evolutionary been or algorithm existing have Quantum control inference. algorithms logic fuzzy evolutionary fuzzy the a computation up quantum of speed used parallelization [38] the Tzafestas for and [3 Rigatos of implementation theo experimental [37]. pure and version the simulated simple from computing the studied to been quantum has network also the neural machine have artificial example, and methods computation For Some quantum states. learning. connect four to with explored [31], system been optics a quantum for and [32] [28]-[30] NMR b using also have demonstrated datab implementations experimental mag- unsorted its a in and nuclear achieve algorithms searching can using classical realized over algorithm 15 speedup Grover been integer square The has (NMR). of it resonance factorization netic and the numbers for factorin prime [27] for [25], into speedup The exponential integers algorithm respectively. an large 1996, Grover give and can 1994 the algorithm in Shor and proposed been [24] have the [26] [23], than algorithm algorithms, problems computation quantum Shor difficult important quantum Two some counterpart. that solve classical shown efficiently more have can results Some field. novel some a propose overcome and method. to learning theory reinforcement quantum explore quantum using we RL learnin paper, in to difficulties and this needed methods In [21], is representation mechanisms. work necessary [20], effective are more [10], more ideas attempts new explore [9], these and successes all [7], satisfactory of achieve practice spite in In [22]. modify methods [1 to RL [18], implemented [17], also related with are [6], or improvements proposed specific also spaces Many is state-action spaces large inference state with fuzzy continuous problems with Q-learning. for Q-learning standard Watkins’ systems and of (SOM) adaptation bas map The RL self-organizing model-less the in model on new generalization a to and presented combined representation [16] are for [11]- Smith example, paradigms learning For learning RL RL. up optimize of as speed kinds problems to Different such (DP) [15]. solve programming to dynamic explored and been decomposit have and methods abstraction Temporal years. recent in dimension. posed its with exponentially grows parameter learned the of be number when the to dimensionality” and huge of problem becomes curse space complex state-action “the the as is for known especially other sometimes speed, The learning knowledge). slow experienced its the of advantage most unu nomto rcsigi ail developing rapidly a is processing information Quantum pro- been have methods many problems, those combat To exploration tyn rvosyunexplored previously (trying exploitation tkn the (taking to t een ase ion the 3]- 9]. ed 0] ry to o g g s s 1 2

tion of overconstrained satisfiability (SAT) and asymmetric the agent) st, and then choose an action at. After executing the travelling salesman (ATSP). Recently the quantum search action, the agent receives a reward rt+1, which reflects how technique has been used to dynamic programming [41]. Taking good that action is (in a short-term sense). The state of the advantage of quantum computation, some novel algorithms environment will change to next state st+1 under the action inspired by quantum characteristics will not only improve the at. The agent will choose the next action at+1 according to performance of existing algorithms on traditional computers, related knowledge. but also promote the development of related research areas The goal of reinforcement learning is to learn a mapping such as quantum computers and machine learning. Considering from states to actions, that is to say, the agent is to learn a the essence of computation and algorithms, Dong and his co- policy π : S i S A(i) [0, 1], so that the expected sum of workers [42] have presented the concept of quantum rein- discounted reward× ∪ ∈ of each→ state will be maximized: forcement learning (QRL) inspired by the state superposition principle and quantum parallelism. Following this concept, we V π = E r + γr + . . . s = s, π (s) { (t+1) (t+2) | t } in this paper give a formal quantum reinforcement learning π = E[r(t+1) + γVs t st = s, π] algorithm framework and specifically demonstrate the advan- ( +1) | a a π tages of QRL for speeding up learning and obtaining a good = π(s,a)[rs + γ pss′ V(s′)] (1) ′ aXAs Xs tradeoff between exploration and exploitation of RL through ∈ simulated experiments and some related discussions. where γ [0, 1] is a discount factor, π(s,a) is the probability This paper is organized as follows. Section II contains the of selecting∈ action a according to state s under policy π, prerequisite and problem description of standard reinforcement a p ′ = P r s = s s = s,a = a is the probability learning, quantum computation and related quantum gates. ss t+1 ′ t t for state transition{ and |ra = E r s }= s,a = a is the In Section III, quantum reinforcement learning is introduced s t+1 t t expected one-step reward. V {(or V|(s)) is also called} the systematically, where the state (action) space is represented (s) value function of state s and the temporal difference (TD) with the quantum state, the exploration strategy based on the one-step updating rule of V (s) may be described as collapse postulate is achieved and a novel QRL algorithm is proposed specifically. Section IV analyzes related character- V (s) V (s)+ α(r + γV (s′) V (s)) (2) istics of QRL such as the convergence, optimality and the ← − balancing between exploration and exploitation. Section V where α (0, 1) is the learning rate. We have the optimal describes the simulated experiments and the results demon- state-value∈ function strate the effectiveness and superiority of QRL algorithm. In Section VI, we briefly discuss some related problems of QRL a a V(∗s) = max[rs + γ pss′ V(∗s′)] (3) a As for future work. Concluding remarks are given in section VII. ∈ Xs′

π II. PREREQUISITEANDPROBLEMDESCRIPTION π∗ = argmax V , s S (4) π (s) ∀ ∈ In this section we first briefly review the standard reinforcement learning algorithms and then introduce the background In dynamic programming, (3) is also called the Bellman of quantum computation and some related quantum gates. equation of V ∗. As for state-action pairs, there are similar value functions π A. Reinforcement learning (RL) and Bellman equations, and Q(s,a) stands for the value of taking the action a in the state s under the policy π: Standard framework of reinforcement learning is based on discrete-time, finite-state Markov decision processes (MDPs) π Q = E r(t+1) + γr(t+2) + . . . st = s,at = a, π [1]. (s,a) { | } a a π Definition 1 (MDP): A Markov decision process = rs + γ pss′ V(s′) (MDP) is composed of the following five factors: Xs′ a a π S, A(i),pij (a), r(i,a),V,i,j S,a A(i) , where: S = rs + γ pss′ π(s′,a′)Q(s′,a′) (5) { ∈ ∈ } ′ ′ is the state space; A(i) is the action space for state i; pij (a) Xs Xa is the probability for state transition; r is a reward function, a a π r : Γ ( , + ), where Γ = (i,a) i S,a A(i) ; V ′ ′ ′ Q(∗s,a) = max Q(s,a) = rs + γ pss max′ Q(s ,a ) (6) → −∞ ∞ { | ∈ ∈ } π a is a criterion function or objective function. Xs′ According to the definition of MDP, we know that the MDP history is composed of successive states and decisions: Let α be the learning rate, and the one-step updating rule hn = (s0,a0,s1,a1,...,sn 1,an 1,sn). The policy π is a of Q-learning (a widely used RL algorithm) [5] is: − − sequence: π = (π0, π1,... ), when the history at n is hn, the strategy is adopted to make a decision according to the Q(st,at) (1 α)Q(st,at)+ α(rt+1 + γ max′ Q(st+1,a′)) ← − a probability distribution π ( h ) on A . (7) n •| n (sn) RL algorithms assume that the state S and action A(sn) can There are many effective standard RL algorithms like Q- be divided into discrete values. At a certain step t, the agent learning, for example TD(λ), Sarsa, etc. For more details see observes the state of the environment (inside and outside of [1]. 3

B. State superposition and quantum parallelism in f(z) . According to equations (9) and (10), we can easily | i Analogous to classical bits, the fundamental concept in obtain [44]: quantum computation is the quantum bit (qubit). The two basic Uz z, 0 = α 0,f(0) + β 1,f(1) (11) states for a qubit are denoted as 0 and 1 , which correspond | i | i | i to the states 0 and 1 for a classical| i bit.| However,i besides 0 The result contains information about both f(0) and f(1), and or 1 , a qubit can also lie in the superposition state of 0 and| i we seem to evaluate f(z) for two values of z simultaneously. 1 |. Ini other words, a qubit ψ can generally be expressed| i as The above process corresponds to a “quantum black box” (or a| i linear combination of 0 |andi 1 oracle). By feeding quantum superposition states to a quantum | i | i black box, we can learn what is inside with an exponential ψ = α 0 + β 1 (8) speedup, compared to how long it would take if we were only | i | i | i allowed classical inputs [43]. where α and β are complex coefficients. This special quantum Now consider an n-qubit system, which can be represented phenomenon is called state superposition principle, which is with tensor product of n qubits: an important difference between classical computation and quantum computation [43]. 11...1 φ = ψ ψ . . . ψ = C x (12) The physical carrier of a qubit is any two-state quantum | i | 1i ⊗ | 2i⊗ | ni x| i 1 x=00X...0 system such as a two-level atom, spin- 2 particle and polarized photon. For a physical qubit, when we select a set of bases 11...1 2 where ‘ ’ means tensor product, x=00...0 Cx = 1, Cx 0 or 1 , we indicate that an observable Oˆ of the qubit ⊗ 2 | | | i | i is complex coefficient and Cx representsP occurrence prob- system has been chosen and the bases correspond to the two ability of x when state φ| is| measured. x can take on 2n ˆ eigenvectors of O. For convenience, the measurement process values, so| thei superposition| i state can be looked upon as the ˆ on the observable O of a quantum system in corresponding superposition of all integers from 0 to 2n 1. Since U is state ψ is directly called a measurement of quantum state − | i a unitary transformation, computing function f(x) can result ψ in this paper. When we measure a qubit in superposition [43]: state| i ψ , the qubit system would collapse into one of its 11...1 11...1 11...1 basic| statesi 0 or 1 . However, we cannot determine in | i | i U C x, 0 = C U x, 0 = C x, f(x) advance whether it will collapse to state 0 or 1 . We only x| i x | i x| i | i | i x=00X...0 x=00X...0 x=00X...0 know that we get this qubit in state 0 with probability α 2, (13) or in state 1 with probability β|2.i Hence α and β| are| | i | | Based on the above analysis, it is easy to find that an n-qubit generally called probability amplitudes. The magnitude and system can simultaneously process 2n states although only one argument of probability amplitude represent amplitude and of the 2n states is accessible through a direct measurement phase, respectively. Since the sum of probabilities must be and the ability is required to extract information about more equal to 1, α and β should satisfy α 2 + β 2 =1. than one value of f x from the output superposition state | | | | ( ) According to quantum computation theory, a fundamental [44]. This is different from classical parallel computation, operation in the quantum computing process is a unitary where multiple circuits built to compute are executed simulta- transformation U on the qubits. If one applies a transformation neously, since quantum parallelism doesn’t necessarily make U to a superposition state, the transformation will act on a tradeoff between computation time and needed physical all basis vectors of this state and the output will be a new space. In fact, quantum parallelism employs a single circuit superposition state obtained by superposing the results of to simultaneously evaluate the function for multiple values all basis vectors. It seems that the transformation U can by exploiting the quantum state superposition principle and simultaneously evaluate the different values of a function f(x) provides an exponential-scale computation space in the n-qubit for a certain input x and it is called quantum parallelism. The linear physical space. Therefore quantum computation can quantum parallelism is one of the most important factors to effectively increase the computing speed of some important acquire the powerful ability of quantum algorithm. However, classical functions. So it is possible to obtain significant result note that this parallelism is not immediately useful [44] since through fusing the quantum computation into the reinforce- the direct measurement on the output generally gives only f(x) ment learning theory. for one value of x. Suppose the input qubit z lies in the superposition state: | i C. Quantum Gates z = α 0 + β 1 (9) | i | i | i In the classical computation, the logic operators that can complete some specific tasks are called logic gates, such as The transformation U which describes computing process z NOT gate, AND gate, XOR gate, and so on. Analogously, may be defined as follows: quantum computing tasks can be completed through quantum gates. Nowadays some simple quantum gates such as quan- Uz : z, 0 z,f(z) (10) | i → | i tum NOT gate and quantum CNOT gate have been built in where z, 0 represents the joint input state with the first qubit quantum computation. Here we only introduce two important in z |andi the second qubit in 0 , and z,f(z) is the joint quantum gates, Hadamard gate and phase gate, which are output| i state with the first qubit| ini z and| thei second qubit closely related to accomplish some quantum logic operations | i 4 for the present quantum reinforcement learning. The detailed a general closed quantum system can be represented with a discussion about quantum gates can be found in the Ref. [44]. unit vector ψ (Dirac representation) in a Hilbert space. The | i Hadamard gate (or Hadamard transform) is one of the most inner product of ψ1 and ψ2 can be written into ψ1 ψ2 useful quantum gates and can be represented as [44]: and the normalization| i condition| i for ψ is ψ ψ =1.h As| thei simplest quantum mechanical system,| i theh state| i of the qubit 1 1 1 H (14) can be described as (8) and its normalization condition is ≡ √2 1 1 − equivalent to α 2 + β 2 =1. Through Hadamard gate, a qubit in the state 0 is transformed Remark 2: |According| | | to the superposition principle in | i into an equally weighted superposition state of 0 and 1 , i.e. quantum computation, since a quantum reinforcement learning | i | i system can lie in some orthogonal quantum states, which 1 1 1 1 1 1 H 0 = 0 + 1 (15) correspond to the eigen states (eigen actions), it can also lie in | i≡ √2 1 1 0 √2| i √2| i − an arbitrary superposition state. That is to say, a QRL system Similarly, a qubit in the state 1 is transformed into the which can take on the states (or actions) ψ is also able to | i n superposition state 1 0 1 1 , i.e. the magnitude of the occupy their linear superposition state (or| action)i √2 | i− √2 | i amplitude in each state is 1 , but the phase of the amplitude √2 ψ = βn ψn (17) in the state 1 is inverted. In classical probabilistic algorithms, | i | i | i Xn the phase has no analog since the amplitudes are in general It is worth noting that this is only a representation method and complex numbers in quantum mechanics. our goal is to take advantage of the quantum characteristics The other related quantum gate is the phase gate (condi- in the learning process. In fact, the state (action) in QRL is tional phase shift operation) which is an important element to not a practical state (action) and it is only an artificial state carry out the Grover iteration for reinforcing “good” decision. (action) for computing convenience with quantum systems. According to quantum information theory, this transformation The practical state (action) is the eigen state (eigen action) may be efficiently implemented on a quantum computer. For in QRL. For an arbitrary state (or action) in a quantum example, the transformation describing this for a two-state reinforcement learning system, we can obtain Proposition 1. system is of the form: Proposition 1: An arbitrary state S (or action A ) in QRL | i | i 1 0 can be expanded in terms of an orthogonal set of eigen states Uphase = iϕ (16) 0 e sn (or eigen actions an ), i.e. | i | i where i √ and ϕ is arbitrary real number [26]. S = αn sn (18) = 1 | i | i − Xn III. QUANTUMREINFORCEMENTLEARNING (QRL) A = β a (19) | i n| ni Just like the traditional reinforcement learning, a quantum Xn reinforcement learning system can also be identified for three where αn and βn are probability amplitudes, and satisfy 2 2 main subelements: a policy, a reward function and a model αn =1 and βn =1. n | | n | | of the environment (maybe not explicit). But quantum rein- PRemark 3: The statesP and actions in QRL are different from forcement learning algorithms are remarkably different from those in traditional RL: (1) The sum of several states (or all those traditional RL algorithms in the following intrinsic actions) does not have a definite meaning in traditional RL, aspects: representation, policy, parallelism and updating oper- but the sum of states (or actions) in QRL is still a possible state (or action) of the same quantum system. (2) When S ation. | i takes on an eigen state sn , it is exclusive. Otherwise, it has the probability of α 2 |to bei in the eigen state s . The same A. Representation n n analysis also is suitable| | to the action A . | i As we represent a QRL system with quantum concepts, Since quantum computation is built| i upon the concept of similarly, we have the following definitions and propositions qubit as what has been described in Section II, for the for quantum reinforcement learning. convenience of processing, we consider to use multiple qubit Definition 2 (Eigen states (or eigen actions)): Select an systems to express states and actions and propose a formal observable of a quantum system and its eigenvectors form a representation of them for the QRL system. Let Ns and Na set of complete orthonormal bases in a Hilbert space. The be the number of states and actions, then choose numbers m states s (or actions a) in Definition 1 are denoted as the and n, which are characterized by the following inequalities: corresponding orthogonal bases and are called the eigen states N 2m 2N ,N 2n 2N (20) or eigen actions in QRL. s ≤ ≤ s a ≤ ≤ a Remark 1: In the remainder of this paper, we indicate that And use m and n qubits to represent eigen state set S = s {| ii} an observable has been chosen but we do not present the and eigen action set A = aj respectively, we can obtain observable specifically when mentioning a set of orthogonal the corresponding relations{| as follows:i} bases. From Definition 2, we can get the set of eigen states: S, m and that of eigen actions for state i: A(i). The eigen state (eigen Ns 11 1 s · · · action) in QRL corresponds to the state (action) in traditional s(N ) = C s s(m) = z }| { C s (21) | i i| ii ↔ | i s| i RL. According to quantum mechanics, the quantum state for Xi=1 s=00X 0 ··· 5

n operation which can simultaneously process these 2m states Na 11 1 with the TD(0) value updating rule a · · · a(N ) = C a a(n) = z }| { C a (22) | si i j | j i ↔ | s i a| i Xj=1 a=00X 0 V (s) V (s)+ α(r + γV (s′) V (s)) (27) ··· ← − In other words, the states (or actions) of a QRL system may where α is the learning rate, and the meaning of reward r and lie in the superposition state of eigen states (or eigen actions). value function V is the same as that in traditional RL. It is Inequalities in (20) guarantee that every states and actions in like parallel value updating of traditional RL over all states. traditional RL have corresponding representation with eigen However, it provides an exponential-scale computation space states and eigen actions in QRL. The probability amplitude in the m-qubit linear physical space and can speed up the Cs and Ca are complex numbers and satisfy solutions of related functions. In this paper we will simulate m QRL process on the traditional computer in Section V. How to 11 1 realize some specific functions of the algorithm using quantum · · · z }| { C 2 =1 (23) gates in detail is our future work. | s| s=00X 0 ··· n D. Probability amplitude updating 11 1 · · · In QRL, action selection is executed by measuring action z }| { C 2 =1 (24) (n) | a| as related to certain state s , which will collapse to a a=00X 0 | i | i 2 | i ··· with the occurrence probability of Ca . So it is no doubt that probability amplitude updating is| the| key of recording the B. Action selection policy “trial-and-error” experience and learning to be more intelli- In QRL, the agent is also to learn a policy π : S gent. × (n) n i S A(i) [0, 1], which will maximize the expected sum of As the action as is the superposition of 2 possible eigen discounted∪ ∈ → reward of each state. That is to say, the mapping actions, finding out| ai is usually interacting with changing its | i from states to actions is π : S A, and we have probability amplitude for a quantum system. The updating of → n probability amplitude is based on the Grover iteration [26]. 11 1 First, prepare the equally weighted superposition of all eigen (n) · · · actions f(s)= a = z }| { C a (25) n | s i a| i a=00X 0 ··· 11 1 (n) 1 · · · where probability amplitude Ca satisfies (24). Here, the action a = (z }| { a ) (28) | 0 i √ n | i selection policy is based on the collapse postulate: 2 a=00X 0 ··· Definition 3 (Action collapse): When an action A = This process can be done easily by applying n Hadamard gates β a is measured, it will be changed and collapse| i ran- n n| ni in sequence to n independent qubits with initial states in 0 Pdomly into one of its eigen actions an with the corresponding respectively [26], which can be represented into: | i 2 | i probability an A : |h | i| n 2 2 2 2 n an A = ( an )∗ A = ( an )∗ βn an = βn 11 1 |h | i| | | i | i| | | i | i| | | n 1 · · · Xn H⊗ 00 0 = (z }| { a ) (29) (26) | · · · i √2n | i (n) z }| { a=00X 0 Remark 4: According to Definition 3, when an action as ··· | i We know that a is an eigen action, irrespective of the value in (25) is measured, we will get a with the occurrence | i 2 | i of a, so that probability of Ca . In QRL algorithm, we will amplify | | (n) 1 the probability of “good” action according to corresponding a a = (30) 0 n rewards. It is obvious that the collapse action selection method |h | i| √2 is not a real action selection policy theoretically. It is just a To construct the Grover iteration we will combine two fundamental phenomenon when a quantum state is measured, reflections U and U (n) [44] a a which results in a good balancing between exploration and 0 exploitation and a natural “action selection” without setting Ua = I 2 a a (31) − | ih | parameters. More detailed discussion about the action selection n n (n) (n) U (n) H I H a a I (32) a = ⊗ (2 0 0 ) ⊗ =2 0 0 can also be found in Ref. [45] 0 | ih |− | ih |− where I is unitary matrix with appropriate dimensions and C. Paralleling state value updating Ua corresponds to the oracle O in the Grover algorithm [44]. The external product a a is defined a a = a ( a )∗. In Proposition 1, we pointed out that every possible state of | ih | | ih | | i | i QRL S can be expanded in terms of an orthogonal complete Obviously, we have set of| eigeni states s : S = α s . According to | ni | i n n| ni Ua a = (I 2 a a ) a = a 2 a = a (33) quantum parallelism, a certain unitaryP transformation U on | i − | ih | | i | i− | i −| i the qubits can be implemented. Suppose we have such an U a⊥ = (I 2 a a ) a⊥ = a⊥ (34) a| i − | ih | | i | i 6

′ Fig. 1. The schematic of a single Grover iteration. Ua ﬂips |asi into |asi Fig. 2. The effect of Grover iterations in Grover algorithm and QRL. (a) ′ ′′ 2 and U n ﬂips a into a . One Grover iteration UGrov rotates as by Initial state; (b) Grover iterations for amplifying |Ca| to almost 1; (c) Grover a( ) | si | s i | i 2 0 iterations for reinforcing action |ai to probability sin [(2L + 1)θ] 2θ.

probability amplitude of a(n) is updated by carrying out where a⊥ represents an arbitrary state orthogonal to a . s | i | i L int k r V s times| ofi Grover iterations, where int x Hence Ua flips the sign of the action a , but acts trivially = ( ( + ( ′))) ( ) on any action orthogonal to a . This| transformationi has a returns the integer part of x. k is a parameter which indicates | i simple geometrical interpretation. Acting on any vector in the that the times L of iterations is proportional to r + V (s′). n The selection of its value is experiential in this paper and its 2 -dimensional Hilbert space, Ua reflects the vector about the hyperplane orthogonal to a . Analogous to the analysis in optimization is an open question. The probability amplitudes | i will be normalized with C 2 = 1 after each updating. the Grover algorithm, Ua can be looked upon as a quantum a | a| black box, which can effectively justify whether the action is According to Ref. [46], weP know that applying Grover iteration (n) (n) the “good” eigen action. Similarly, U (n) preserves a , but UGrov for L times on a0 can be represented as a 0 0 | i | i (n) L (n) flips the sign of any vector orthogonal to a . U a = sin[(2L + 1)θ] a + cos[(2L + 1)θ] a⊥ (40) | 0 i Grov| 0 i | i | i Thus one Grover iteration is the unitary transformation [28], 1 Obviously, we can reinforce the action a from probability n [44] | i 2 to sin2[(2L+1)θ] through Grover iterations. Since sin(2L+1)θ U U (n) U (35) Grov = a a 0 is a periodical function about (2L+1)θ and too much iterations Now let’s consider how the Grover iteration acts in the plane may also cause small probability sin2[(2L + 1)θ], we further (n) π 1 spanned by a and a . The initial action in equation (28) select L = min int(k(r + V (s′))), int( ) . 0 { 4θ − 2 } can be re-expressed| i | as i Remark 5: The probability amplitude updating is inspired by the Grover algorithm and the two procedures use the same n (n) 1 2 1 amplitude amplification technique as a subroutine. Here we f(s)= a0 = a + − a⊥ (36) | i √2n | i r 2n | i want to emphasize the difference between the probability Recall that amplitude updating and Grover’s database searching algorithm. The objective of Grover algorithm is to search a by (n) 1 | i a0 a = sin θ (37) amplifying its occurrence probability to almost 1, however, |h | i| √2n ≡ the aim of probability amplitude updating process in QRL Thus (n) just appropriately updates (amplifies or shrinks) corresponding f(s)= a = sin θ a + cos θ a⊥ . (38) | 0 i | i | i amplitudes for “good” or “bad” eigen actions. So the essential This procedure of Grover iteration UGrov can be visualized difference is in the times L of iterations and this can be geometrically by Fig. 1. demonstrated by Fig. 2. This figure shows that a(n) is rotated by θ from the axis | 0 i a⊥ normal to a in the plane. U reflects a vector a in E. QRL algorithm | i | i a | si the plane about the axis a to a , and U (n) reflects the ⊥ s′ a Based on the above discussion, the procedural form of | i | i 0 (n) a standard QRL algorithm is described as Fig. 3. In QRL vector as′ about the axis a0 to as′′ . From Fig. 1 we know that | i | i | i algorithm, after initializing the state and action we can observe α β (n) as and obtain an eigen action a . Execute this action and − + β = θ (39) | i | i 2 the system can give out next state s′ , reward r and state value | i Thus we have α + β =2θ. So one Grover iteration UGrov = V (s′). V (s) is updated by TD(0) rule, and r and V (s′) can U (n) U rotates any vector a by θ. a a s 2 be used to determine the iteration times L. To accomplish the 0 | i We now can carry out a certain times of Grover iterations task in a practical computing device, we require some basic to update the probability amplitudes according to respective registers for the storage of related information. Firstly two m- rewards and value functions. It is obvious that 2θ is the qubit registers are required for all eigen states and their state updating stepsize. Thus when an action a is executed, the values V (s), respectively. Secondly every eigen state requires | i 7

Fig. 3. The algorithm of a standard quantum reinforcement learning (QRL) two n-qubit registers for their respective eigen actions stored shows much better performance than other methods when the for two times, where one n-qubit register stores the action searching space becomes very large. (n) as to be observed and the other n-qubit register also stores |the samei action for preventing the memory loss associated A. Convergence of QRL to the action collapse. It is worth mentioning that this does In QRL we use the temporal difference (TD) prediction for not conflict with the no-cloning theorem [44] since the action the state value updating, and TD algorithm has been proved (n) as is a certain known state at each step. Finally several to converge for absorbing Markov chain [4] when the learning | i simple classical registers may be required for the reward r, rate is nonnegative and degressive. To generally consider the the times L, and etc. convergence results of QRL, we have Proposition 2. Remark 6: QRL is inspired by the superposition principle Proposition 2 (Convergence of QRL): For any Markov of quantum state and quantum parallelism. The action set can chain, QRL algorithms converge to the optimal state value be represented with the quantum state and the eigen action can function V ∗(s) with probability 1 under proper exploration be obtained by randomly observing the simulated quantum policy when the following conditions hold (where αk is state, which will lead to state collapse according to the learning rate and nonnegative): quantum measurement postulate. The occurrence probability T T of every eigen action is determined by its corresponding 2 lim αk = , lim αk < (41) T ∞ T ∞ probability amplitude, which is updated according to rewards →∞ Xk=1 →∞ Xk=1 and value functions. So this approach represents the whole state-action space with the superposition of quantum state and Proof: (sketch) Based on the above analysis, QRL is a makes a good tradeoff between exploration and exploitation stochastic iterative algorithm. Bertsekas and Tsitsiklis have using probability. verified the convergence of stochastic iterative algorithms [3] Remark 7: The merit of QRL is dual. First, as for simu- when (41) holds. In fact many traditional RL algorithms have lation algorithm on the traditional computer it is an effective been proved to be stochastic iterative algorithms [3], [4], [47] algorithm with novel representation and computation methods. and QRL is the same as traditional RL, and main differences Second, the representation and computation mode are consis- lie in: tent with quantum parallelism and can speed up learning with (1) Exploration policy is based on the collapse postulate of quantum computers or quantum gates. quantum measurement while being observed; (2) This kind of algorithms is carried out by quantum parallelism, which means we update all states simultaneously IV. ANALYSIS OF QRL and QRL is a synchronous learning algorithm. In this section, we discuss some theoretical properties of So the modification of RL does not affect the characteristic QRL algorithms and provide some advice from the point of of convergence and QRL algorithm converges when (41) view of engineering. Four major results are presented: (1) an holds. asymptotic convergence proposition for QRL algorithms, (2) the optimality and stochastic algorithm, (3) good balancing B. Optimality and stochastic algorithm between exploration and exploitation, and (4) physical real- Most quantum algorithms are stochastic algorithms which ization. From the following analysis, it is obvious that QRL can give the correct decision-making with probability 1-ǫ 8

(ǫ > 0, close to 0) after several times of repeated computing [48], [49]. It uses a positive parameter τ called the temper- [23], [25]. As for quantum reinforcement learning algorithms, ature and chooses action with the probability proportional to optimal policies are acquired by the collapse of quantum eQ(s, a)/τ . It can move from exploration to exploitation by system and we will analyze the optimality of these policies adjusting the “temperature” parameter τ. It is natural to sample from two aspects as follows. actions according to this distribution, but it is very difficult 1) QRL implemented by real quantum apparatuses: When to set and adjust a good parameter τ. There are also similar QRL algorithms are implemented by real quantum appara- problems with simulated annealing (SA) methods [50]. tuses, the agent’s strategy is given by the collapse of corre- We have introduced the action selecting strategy of QRL in sponding quantum system according to probability amplitude. Section III, which is called collapse action selection method. QRL algorithms can not guarantee the optimality of every The agent does not bother about selecting a proper action strategy, but it can give the optimal decision-making with consciously. The action selecting process is just accomplished the probability approximating to 1 by repeating computation by the fundamental phenomenon that it will naturally collapse several times. Suppose that the agent gives an optimal strategy to an eigen action when an action (represented by quantum with the probability 1 ǫ after the agent has well learned superposition state) is measured. In the learning process, the − (state value function converges to V ∗(s)). For ǫ (0, 1), agent can explore more effectively since the state and action the error probability is ǫd by repeating d times. Hence∈ the can lie in the superposition state through parallel updating. agent will give the optimal strategy with the probability of When an action is observed, it will collapse to an eigen action 1 ǫd by repeating the computation for d times. The QRL with a certain probability. Hence QRL algorithm is essentially algorithms− on real quantum apparatuses are still effective due a kind of probability algorithm. However, it is greatly different to the powerful computing capability of quantum system. Our from classical probability since classical algorithms forever current work has been focused on simulating QRL algorithms exclude each other for many results, but in QRL algorithm on the traditional computer which also bear the characteristics it is possible for many results to interfere with each other to inspired by quantum systems. yield some global information through some specific quantum 2) Simulating QRL on the traditional computer: As men- gates such as Hadmard gates. Compared with other exploration tioned above, in this paper most work has been done to develop strategy, this mechanism leads to a better balancing between this kind of novel QRL algorithms by simulating on the exploration and exploitation. traditional computer. But in traditional RL theory, researchers In this paper, the simulated results will show that the have argued that even if we have a complete and accurate action selection method using the collapse phenomenon is very model of the environment’s dynamics, it is usually not possible extraordinary and effective. More important, it is consistent to simply compute an optimal policy by solving the Bellman with the physical quantum system, which makes it more optimality equation [1]. What’s the fact about QRL? In QRL, natural, and the mechanism of QRL has the potential to be the optimal value functions and optimal policies are defined implemented by real quantum systems. in the same way as traditional RL. The difference lies in the representation and computing mode. The policy is probabilistic D. Physical realization instead of being definite using probability amplitude, which As a quantum algorithm, the physical realization of QRL is makes it more effective and safer. But it is still obvious that also feasible since the two main operations occur in preparing simulating QRL on the traditional computer can not speed up the equally weighted superposition state for initializing the learning in exponential scale since the quantum parallelism quantum system and carrying out a certain times of Grover is not really executed through real physical systems. What’s iterations for updating probability amplitude according to more, when more powerful computation is available, the agent rewards and value functions. These are the same operations will learn much better. Then we may fall back on physical needed in the Grover algorithm. They can be accomplished realization of quantum computation again. using different combinations of Hadamard gates and phase gates. So the physical realization of QRL has no difficulty C. Balancing between exploration and exploitation in principle. Moreover, the experimental implementations of One widely used action selection scheme is ǫ-greedy [48], the Grover algorithm also demonstrate the feasibility for the [49], where the best action is selected with probability (1 ǫ) physical realization of our QRL algorithm. and a random action is selected with probability ǫ ( ǫ (0, 1)− ). The exploration probability ǫ can be reduced over time,∈ which V. EXPERIMENTS moves the agent from exploration to exploitation. The ǫ-greedy To evaluate QRL algorithm in practice, consider the typical method is simple and effective but it has one drawback that gridworld example. The gridworld environment is as shown in when it explores it chooses equally among all actions. This Fig. 4 and each cell of the grid corresponds to an individual means that it makes no difference to choose the worst action state (eigen state) of the environments. From any state the or the next-to-best action. Another problem is that it is difficult agent can perform one of four primary actions (eigen actions): to choose a proper parameter ǫ which can offer an optimal up, down, left and right, and actions that would lead into a balancing between exploration and exploitation. blocked cell are not executed. The task of the algorithms is to Another kind of action selection scheme is Boltzmann find an optimal policy which will let the agent move from start exploration (including Softmax action selection method) [1], point S to goal point G with minimized cost (or maximized 9

B. Experimental results and analysis

Learning performance for QRL algorithm compared with TD algorithm in traditional RL is plotted in Fig. 5, where the cases with the good performance are chosen for both of the QRL and TD algorithms. As shown in Fig. 5, the good cases in this gridworld example are respectively TD algorithm with the learning rate of α = 0.01 and QRL algorithm with α = 0.06. The horizontal axis represents the episode in the learning process and the number of steps required is correspondingly described by the vertical coordinate. We observe that QRL algorithm is also an effective algorithm on the traditional computer although it is inspired by the quantum mechanical system and is designed for quantum computers in the future. For their respective rather good cases in Fig. 5, QRL explores more than TD algorithm at the beginning of learning phase, but it learns much faster and guarantees a better balancing between exploration and exploitation. In addition, it is much easier to tune the parameters for QRL algorithms than for traditional ones. If the real quantum parallelism is used, we can obtain the estimated theoretical results. What’s Fig. 4. The example is a gridworld environment with cell-to-cell actions (up, down, left and right). The labels S and G indicate the initial state and more important, according to the estimated theoretical results, the goal in the simulated experiment described in the text. QRL has great potential of powerful computation provided that the quantum computer (or related quantum apparatuses) is available in the future, which will lead to a more effective rewards). An episode is defined as one time of learning process approach for the existing problems of learning in complex when the agent moves from the start state to the goal state. unknown environments. But when the agent cannot find the goal state in a maximum Furthermore, in the following comparison experiments we steps (or a period of time), this episode will be terminated and give the results of TD(0) algorithm in QRL and RL algorithms start another episode from the start state again. So when the with different learning rates, respectively. In Fig. 6 it illustrates agent finds an optimal policy through learning, the number of the results of QRL algorithms with different learning rates: α moving steps for one episode will reduce to a minimum one. (alpha), from 0.01 to 0.11, and to give a particular description of the learning process, we record every learning episodes. From these figures, it can been concluded that given a proper A. Experimental set-up learning rate (0.02 alpha 0.10) this algorithm learns fast and explores much≤ at the beginning≤ phase, and then steadily In this 20 20 (0 19) gridworld, the initial state S and the converges to the optimal policy that costs 36 steps to the goal goal state G×is cell(1,1)∼ and cell(18,18) and before learning state G. As the learning rate increases from 0.02 to 0.09, the agent has no information about the environment at all. this algorithm learns faster. When the learning rate is 0.01 or Once the agent finds the goal state it receives a reward of smaller, it explores more but learns very slow, so the learning r = 100 and then ends this episode. All steps are punished by process converges very slowly. Compared with the result of a reward of r = 1. The discount factor γ is set to 0.99 TD in Fig. 5, we find that the simulation result of QRL on and all of the state− values V(s) are initialized as V = 0 the classical computer does not show advantageous when the for all the algorithms that we have carried out. In the first learning rate is small (alpha 0.01). On the other hand, ≤ experiment, we compare QRL algorithm with TD(0) and we when the learning rate is 0.11 or above, it cannot converge to also demonstrate the expected result on a quantum computer the optimal policy because it vibrates with too large learning theoretically. In the second experiment, we give out some rate when the policy is near the optimal policy. Fig. 7 shows results of QRL algorithm with different learning rates. For the performance of TD(0) algorithm, and we can see that the the action selection policy of TD algorithm, we use ǫ-greedy learning process converges with the learning rate of 0.01. But policy (ǫ =0.01), that is to say, the agent executes the “good” when the learning rate is bigger (alpha=0.02, 0.03 or bigger), action with probability 1 ǫ and chooses other actions with it becomes very hard for us to make it converge to the optimal an equal probability. As for− QRL, the action selecting policy policy within 10000 episodes. Anyway from Fig. 6 and Fig. 7, is obviously different from traditional RL algorithms, which is we can see that the convergence range of QRL algorithm is inspired by the collapse postulate of quantum measurement. much larger than that of traditional TD(0) algorithm. 2 The value of Ca is used to denote the probability of an All the results show that QRL algorithm is effective and | | (n) 11 1 action defined as f(s) = as = ··· C a . For the excels traditional RL algorithms in the following three main | i a=00 0 a| i four cell-to-cell actions, i.e. four eigenP actions··· up, down, left aspects: (1) Action selecting policy makes a good tradeoff and right, C 2 is initialized uniformly. between exploration and exploitation using probability, which | a| 10

Fig. 5. Performance of QRL in the example of a gridworld environment compared with TD algorithm (ǫ-greedy policy) for their respective good cases, and the expected theoretical result on a quantum computer is also demonstrated. speeds up the learning and guarantees the searching over the Model of environments An appropriate model of the • whole state-action space as well. (2) Representation is based environment will make problem-solving much easier and on the superposition principle of quantum mechanics and the more efﬁcient. This is true for most of the RL algorithms. updating process is carried out through quantum parallelism, However, to model environments accurately and simply which will be much more prominent in the future when is a tradeoff problem. As for QRL, this problem should practical quantum apparatus comes into use instead of being be considered slightly differently due to some of its simulated on the traditional computers. (3) Compared with specialities. the experimental results in Ref. [51], where the simulation Representations The representations for QRL algorithm • environment is a 13 13 (0 12) gridworld, we can see that according to different kinds of problems would be natu- when the state space× is getting∼ larger, the performance of QRL rally of interest ones when a learning system is designed. is getting better than traditional RL in simulated experiments. In this paper, we mainly discuss problems with discrete states and actions and a natural question is how to extend VI. DISCUSSION QRL to the problems with continuous states and actions The key contribution of this paper is a novel reinforcement effectively. learning framework called quantum reinforcement learning Function approximation and generalization General- • that integrates quantum mechanics characteristics and rein- ization is necessary for RL systems to be applied to forcement learning theories. In this section some associated artiﬁcial intelligence and most engineering applications. problems of QRL on the traditional computer are discussed Function approximation is an important approach to and some future work regarded as important is also pointed acquire generalization. As for QRL, this issue will be out. a rather challenging task and function approximation Although it is a long way for implementing such compli- should be considered with the special computation mode cated quantum systems as QRL by physical quantum systems, of QRL. the simulated version of QRL on the traditional computer has Theory QRL is a new learning framework that is different • been proved effective and also excels standard RL methods in from standard RL in several aspects, such as representa- several aspects. To improve this approach some issues of future tion, action selection, exploration policy, updating style, work is laid out as follows, which we deem to be important. 11

Fig. 6. Comparison of QRL algorithms with different learning rates (alpha= 0.01 ∼ 0.11).

Fig. 7. Comparison of TD(0) algorithms with different learning rates (alpha=0.01, 0.02, 0.03). 12

etc. So there is a lot of theoretical work to do to take most research [51]. Once QRL becomes realizable on real physical advantage of it, especially to analyze the complexity of systems, it can be effectively used to quantum robot learning the QRL algorithm and improve its representation and for accomplishing some significant tasks [52], [53]. computation. Quantum computation and machine learning are both the More applications Besides more theoretical research, study of the information processing tasks. The two research • a tremendous opportunity to apply QRL algorithms to fields have rapidly grown so that it gives birth to the combining a range of problems is needed to testify and improve of traditional learning algorithms and quantum computation this kind of learning algorithms, especially in unknown methods, which will influence representation and learning probabilistic environments and large learning space. mechanism, and many difficult problems could be solved Anyway we strongly believe that QRL approaches and appropriately in a new way. Moreover, this idea also pioneersa related techniques will be promising for agent learning in new field for quantum computation and artificial intelligence large scale unknown environment. This new idea of applying [52], [53], and some efficient applications or hidden advan- quantum characteristics will also inspire the research in the tages of quantum computation are probably approached from area of machine learning. the angle of learning and intelligence.

VII. CONCLUDING REMARKS ACKNOWLEDGMENT In this paper, QRL is proposed based on the concepts The authors would like to thank two anonymous reviewers, and theories of quantum computation in the light of the Dr. Bo Qi and the Associate Editor of IEEE Trans. SMCB existing problems in RL algorithms such as tradeoff between for constructive comments and suggestions which help clarify exploration and exploitation, low learning speed, etc. Inspired several concepts in our original manuscript and have greatly by state superposition principle, we introduce a framework of improved this paper. D. Dong also wishes to thank Prof. Lei value updating algorithm. The state (action) in traditional RL is Guo and Dr. Zairong Xi for helpful discussions. looked upon as the eigen state (eigen action) in QRL. The state (action) set can be represented by the quantum superposition REFERENCES state and the eigen state (eigen action) can be obtained by randomly observing the simulated quantum state according to the [1] R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. collapse postulate of quantum measurement. The probability [2] L. P. Kaelbling, M. L. Littman and A. W. Moore, “Reinforcement of eigen state (eigen action) is determined by the probability learning: A survey,” Journal of Artificial Intelligence Research, vol.4, amplitude, which is updated according to rewards and value pp.237-287, 1996. [3] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. functions. So it makes a good tradeoff between exploration Belmont, MA: Athena Scientific, 1996. and exploitation and can speed up learning as well. At the [4] R. Sutton, “Learning to predict by the methods of temporal difference,” same time this novel idea will promote related theoretical and Machine Learning, vol.3, pp.9-44, 1988. [5] C. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol.8, technical research. pp.279-292, 1992. On the theoretical side, it gives us more inspiration to [6] D. Vengerov, N. Bambos and H. Berenji, “A fuzzy reinforcement look for new paradigms of machine learning to acquire better learning approach to control in wireless transmitters,” IEEE Transactions on Systems Man and Cybernetics B, vol.35, pp.768-778, 2005. performance. It also introduces the latest development of [7] H. R. Beom and H. S. Cho, “A sensor-based navigation for a mobile fundamental science, such as physics and mathematics, to the robot using fuzzy logic and reinforcement learning,” IEEE Transactions area of artificial intelligence and promotes the development on Systems Man and Cybernetics, vol.25, pp.464-477, 1995. [8] C. I. Connolly, “Harmonic functions and collision probabilities,” Inter- of those subjects as well. Especially the representation and national Journal of Robotics Research, vol.16, no.4, pp.497-507, 1997. essence of quantum computation are different from classical [9] W. D. Smart and L. P. Kaelbling, “Effective reinforcement learning for computation and many aspects of quantum computation are mobile robots,” in Proceedings of the IEEE International Conference on Robotics and Automation, pp.3404-3410, 2002. likely to evolve. Sooner or later machine learning will also [10] T. Kondo and K. Ito, “A reinforcement learning with evolutionary state be profoundly influenced by quantum computation theory. We recruitment strategy for autonomous mobile robots control,” Robotics have demonstrated the applicability of quantum computation and Autonomous Systems, vol.46, pp.111-124, 2004. [11] M. Wiering and J. Schmidhuber, “HQ-Learning,” Adaptive Behavior, to machine learning and more interesting results are expected vol.6, pp.219-246, 1997. in the near future. [12] A. G. Barto and S. Mahanevan, “Recent advances in hierarchical On the technical side, the results of simulated experiments reinforcement learning,” Discrete Event Dynamic Systems: Theory and Applications, vol.13, pp.41-77, 2003. demonstrate the feasibility of this algorithm and show its [13] R. Sutton, D. Precup and S. Singh, “Between MDPs and semi-MDPs: A superiority for the learning problems with huge state spaces framework for temporal abstraction in reinforcement learning,” Artificial in unknown probabilistic environments. With the progress of Intelligence, vol.112, pp.181-211, 1999. [14] T. G. Dietterich, “Hierarchical reinforcement learning with the Maxq quantum technology, some fundamental quantum operations value function decomposition,” Journal of Artificial Intelligence Re- are being realized via nuclear magnetic resonance, quantum search, vol.13, pp.227-303, 2000. optics, cavity-QED and ion trap. Since the physical realization [15] G. Theocharous, Hierarchical Learning and Planning in Partially Ob- servable Markov Decision Processes. Doctor thesis, Michigan State of QRL mainly needs Hadamard gates and phase gates and University, USA, 2002. both of them are relatively easy to be implemented in quantum [16] A. J. Smith, “Applications of the self-organising map to reinforcement computation, our work also presents a new task to implement learning,” Neural Networks, vol.15, pp.1107-1124, 2002. [17] P. Y. Glorennec and L. Jouffe, “Fuzzy Q-learning,” in Proceedings of QRL using practical quantum systems for quantum compu- the Sixth IEEE International Conference on Fuzzy Systems, pp.659-662, tation and will simultaneously promote related experimental IEEE Press, 1997. 13

[18] S. G. Tzafestas and G. G. Rigatos, “Fuzzy Reinforcement Learning Con- [45] C. Chen, D. Dong and Z. Chen, “Quantum computation for action se- trol for Compliance Tasks of Robotic Manipulators,” IEEE Transaction lection using reinforcement learning,” International Journal of Quantum on System, Man, and Cybernetics B, vol.32, no. 1, pp.107-113, 2002. Information, vol.4, no.6, pp.1071-1083, 2006. [19] M. J. Er and C. Deng, “Online Tuning of Fuzzy Inference Systems Using [46] M. Boyer, G. Brassard and P. Høyer, “Tight bounds on quantum Dynamic Fuzzy Q-Learning,” IEEE Transaction on System, Man, and searching,” Fortschritte Der Physik-Progress of Physics, vol.46, pp.493- Cybernetics B, vol.34, no. 3, pp.1478-1489, 2004. 506, 1998. [20] C. Chen, H. Li and D. Dong, “Hybrid control for autonomous mobile [47] E. Even-Dar and Y. Mansour, “Learning rates for Q-learning,” Journal robot navigation using hierarchical Q-learning,” IEEE Robotics and of Machine Learning Research, vol.5, pp.1-25, 2003. Automation Magazine, vol.15, no. 2, pp.37-47, 2008. [48] T. S. Dahl, M. J. Mataric and G. S. Sukhatme, “Emergent Robot Differ- [21] S. Whiteson and P. Stone, “Evolutionary function approximation for entiation for Distributed Multi-Robot Task Allocation,” in Proceedings reinforcement learning,” Journal of Machine Learning Research, vol.7, of the 7th International Symposium on Distributed Autonomous Robotic pp.877-917, 2006. Systems, pp.191-200, 2004. [22] M. Kaya and R. Alhajj, “A novel approach to multiagent reinforcement [49] J. Vermorel and M. Mohri, “Multi-armed Bandit Algorithms and Em- learning: Utilizing OLAP mining in the learning process,” IEEE Trans- pirical Evaluation,” Lecture Notes in Artificial Intelligence, vol.3720, actions on Systems Man and Cybernetics C, vol.35, pp. 582-590, 2005. pp.437-448, 2005. [23] P. W. Shor, “Algorithms for quantum computation: discrete logarithms [50] M. Guo, Y. Liu and J. Malec, “A New Q-Learning Algorithm Based and factoring,” in Proceedings of the 35th Annual Symposium on Foun- on the Metropolis Criterion,” IEEE Transaction on System, Man, and dations of Computer Science, pp.124-134, IEEE Press, Los Alamitos, Cybernetics B, vol.34, no. 5, pp.2140-2143, 2004. CA, 1994. [51] D. Dong, C. Chen, Z. Chen and C. Zhang, “Quantum mechanics helps [24] A. Ekert and R. Jozsa, “Quantum computation and Shor’s factoring in learning for more intelligent robots,” Chinese Physics Letters, vol.23, algorithm,” Reviews of Modern Physics, vol.68, pp.733-753, 1996. pp.1691-1694, 2006. [25] L. K. Grover, “A fast quantum mechanical algorithm for database [52] D. Dong, C. Chen, C. Zhang and Z. Chen, “Quantum robot: structure, search,” in Proceedings of the 28th Annual ACM Symposium on the algorithms and applications,” Robotica, vol.24, pp.513-521, 2006. Theory of Computation, pp.212-219, ACM Press, New York, 1996. [53] P. Benioff, “Quantum Robots and Environments”, Physical Review A, [26] L. K. Grover, “Quantum mechanics helps in searching for a needle in vol.58, pp.893-904, 1998. a haystack,” Physical Review Letters, vol.79, pp.325-327, 1997. [27] L. M. K. Vandersypen, M. Steffen, G. Breyta, C. S. Yannoni, M. H. Sher- wood and I. L. Chuang, “Experimental realization of Shor’s quantum factoring algorithm using nuclear magnetic resonance,” Nature, vol.414, pp.883-887, 2001. [28] I. L. Chuang, N. Gershenfeld and M. Kubinec, “Experimental implementation of fast quantum searching,” Physical Review Letters, vol.80, pp.3408-3411, 1998. [29] J. A. Jones, “Fast searches with nuclear magnetic resonance computers,” Science, vol.280, pp.229-229, 1998. [30] J. A. Jones, M. Mosca and R. H. Hansen, “Implementation of a quantum Search algorithm on a quantum computer,” Nature, vol.393, pp.344-346, 1998. [31] P. G. Kwiat, J. R. Mitchell, P. D. D. Schwindt and A. G. White, “Grover’s search algorithm: an optical approach,” Journal of Modern Optics, vol.47, pp.257-266, 2000. [32] M. O. Scully and M. S. Zubairy, “Quantum optical implementation of Grover’s algorithm,” Proceedings of the National Academy of Sciences of the United States of America, vol.98, pp.9490-9493, 2001. [33] D. Ventura and T. Martinez, “Quantum associative memory,” Information Sciences, vol.124, pp.273-296, 2000. [34] A. Narayanan and T. Menneer, “Quantum artificial neural network architectures and components,” Information Sciences, vol.128, pp.231- 255, 2000. [35] S. Kak, “On quantum neural computing,” Information Sciences, vol.83, pp.143-160, 1995. [36] N. Kouda, N. Matsui, H. Nishimura and F. Peper, “Qubit neural network and its learning efficiency,” Neural Computing and Applications, vol.14, pp.114-121, 2005. [37] E. C. Behrman, L. R. Nash, J. E. Steck, V. G. Chandrashekar and S. R. Skinner, “Simulations of quantum neural networks,” Information Sciences, vol.128, pp.257-269, 2000. [38] G. G. Rigatos and S. G. Tzafestas, “Parallelization of a fuzzy control algorithm using quantum computation,” IEEE Transactions on Fuzzy Systems, vol.10, no.4, pp.451-460, 2002. [39] M. Sahin, U. Atav and M. Tomak, “Quantum genetic algorithm method in self-consistent electronic structure calculations of a quantum dot with many electrons,” International Journal of Modern Physics C, vol.16, no.9, pp.1379-1393, 2005. [40] T. Hogg and D. Portnov, “Quantum optimization,” Information Sciences, vol.128, pp.181-197, 2000. [41] S. Naguleswaran and L. B. White, “Quantum search in stochastic planning,” Proceedings of SPIE, vol.5846, pp.34-45, 2005. [42] D. Dong, C. Chen and Z. Chen, “Quantum reinforcement learning,” in Proceedings of First International Conference on Natural Computation, Lecture Notes in Computer Science, vol.3611, pp.686-689, 2005. [43] J. Preskill, Physics 229: Advanced Mathematical Methods of Physics–Quantum Information and Computation. California Institute of Technology, 1998. Available electronically via http://www.theory.caltech.edu/people/preskill/ph229/ [44] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information. Cambridge, England: Cambridge University Press, 2000.