Learning CPG Sensory Feedback with Policy Gradient for Biped Locomotion for a Full-body Humanoid Gen Endo*†,JunMorimoto†‡, Takamitsu Matsubara†§, Jun Nakanishi†‡ and Gordon Cheng†‡ * Intelligence Dynamics Laboratories, Inc., 3-14-13 Higashigotanda, Shinagawa-ku, Tokyo, 141-0022, Japan † ‡ ATR Computational Neuroscience Laboratories, Computational Brain Project, ICORP, Japan Science and Technology Agency 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0288, Japan § Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara, 630-0192, Japan [email protected], {xmorimo, takam-m, jun, gordon}@atr.jp

Abstract

This paper describes a learning framework for a central pattern generator based biped locomotion controller us- ing a policy gradient method. Our goals in this study are to achieve biped walking with a 3D hardware humanoid, and to develop an efficient learning algorithm with CPG by reducing the dimensionality of the state space used for learning. We demonstrate that an appropriate feed- back controller can be acquired within a thousand tri- als by numerical simulations and the obtained controller in numerical simulation achieves stable walking with a physical robot in the real world. Numerical simulations and hardware experiments evaluated walking velocity and stability. Furthermore, we present the possibility of an additional online learning using a hardware robot to improve the controller within 200 iterations. Figure 1: QRIO (SDR-4X II)

their specific mechanical design to embed an intrinsic walk- Introduction ing pattern with the passive dynamics, the state space for learning was drastically reduced from 18 to 2 in spite of Humanoid research and development has made remarkable the complexity of the 3D biped model, which usually suf- progress during the past 10 years. Most of presented hu- fers from dimensionality explosion. From a learning point manoids utilize a pre-planned nominal trajectory designed in of view, the dimensionality reduction is an important issue a known environment. Despite our best effort, it seems that in practice. However, developing a specific humanoid hard- we cannot consider every possible situation in advance when ware with uni-functionality, for example walking, may lose designing a controller. Thus learning capability to acquire or an important feature of such as versatility improve a walking pattern is essential for broad range of hu- and capability of achieving various tasks. manoid application in an unknown environment. Our goals in this paper are to acquire a successful walking Therefore, instead of gait implementation by mechani- pattern through learning and to achieve walking with a hard- cal design, we introduce the idea of using a Central Pat- ware 3D full-body humanoid robot (Fig. 1). While many tern Generator (CPG), which has been hypothesized to exist attempts have been made to investigate learning algorithms in the central nervous system of animals (Cohen 2003). It for simulated biped walking, there are only a few successful is demonstrated that the CPG can generate a robust biped implementation on real hardware, for example (Benbrahim walking pattern with appropriate feedback signals by using & Franklin 1997; Tedrake, Zhang, & Seung 2004). To the entrainment property even in an unpredictable environment best of our knowledge, Tedrake et al. (Tedrake, Zhang, & in numerical simulations (Taga 1995). However, designing Seung 2004) is the only example of an implementation of appropriate feedback pathways of neural oscillators often re- learning algorithm on a 3D hardware robot. They imple- quires much effort to manually tune the parameters of the os- mented a learning algorithm on a simple physical 3D biped cillator. Thus, a genetic algorithm (Hase & Yamazaki 1998) robot possessing basic properties of passive dynamic walk, and reinforcement learning (Mori et al. 2004) were applied and successfully obtained appropriate feedback controller to optimize the open parameters of the CPG for biped loco- for ankle roll joints via online learning. With the help of motion. However, these methods often require a large num- ber of iteration to obtain a solution due to the large dimen- Copyright c 2005, American Association for Artificial Intelli- sionality of the state space used for optimization. In this gence (www.aaai.org). All rights reserved. paper, our primary goals are to achieve biped walking with

AAAI-05 / 1267 learning for a 3D full-body humanoid hardware, which is robot (Taga 1995). However, it is difficult to obtain appro- not designed for a specific walking purpose, and to develop priate feedback pathways for all the oscillators to achieve the an efficient learning algorithm which can be implemented desired behavior with the increase of the number of degrees on a hardware robot for an additional online learning to im- of freedom of the robot because neural oscillators are intrin- prove a controller. In a physical robot, we can not accurately sically nonlinear. Moreover, precise torque control of each observe all states of the system due to limited number of joints is also difficult to be realized for a hardware robot in equipped sensors and measurement noise in practice. Thus, practice. In this paper, to simplify the problem, we propose we find it natural to postulate the learning problem as a par- a new oscillator arrangement with respect to the position of tially observable Markov decision problem (POMDP). In the the tip of the leg in the Cartesian coordinate system, which proposed learning system, we use a policy gradient method is reasonably considered as the task coordinates for walk- which can be applied to POMDP (Kimura & Kobayashi ing. We allocate only 6 neural units exploiting symmetry of 1998). In POMDP, it is generally known that a large the walking pattern between the legs. We decompose over- amount of iteration would be required for learning compared all walking motion into stepping motion in place produced with learning in MDP because lack of information yields in the frontal plane and propulsive motion generated in the large variance of the estimated gradient of expected reward sagittal plane. with respect to the policy parameters (Sutton et al. 2000; Fig. 2 illustrates the proposed neural arrangement for the Konda & Tsitsiklis 2003). However, in the proposed frame- stepping motion in place in the frontal plane. We employ work, when the CPG and the mechanical system of the two neurons to form a coupled oscillator connected by a mu- robot converge to a periodic trajectory due to the entrain- tual inhibition (w12 = w21 =2.0) and allocate it to control l r ment property, the internal states of the CPG and the states the position of both legs pz, pz along the Z (vertical) direc- of the robot will be synchronized. Thus, by using the state tion in a symmetrical manner with πradphase difference: space only composed of the observable reduced number of l = − ( − ) states, efficient learning can be possible to achieve steady pz Z0 Az q1 q2 , (4) r periodic biped locomotion even in the POMDP. pz = Z0 + Az (q1 − q2), (5) CPG Control Architecture where Z0 is a position offset and Az is the amplitude scaling factor. In our initial work, we explored CPG-based control with the policy gradient method for a planar biped model (Mat- For a propulsive motion in the sagittal plane, we introduce subara et al. 2005), suggesting that the policy gradient a quad-element neural oscillator to produce coordinated leg method is a promising way to acquire biped locomotion movements with stepping motion based on the following in- within a reasonable numbers of iteration. In this sec- tuition: as illustrated in Fig. 3, when the robot is walking for- tion, we extend our previous work to a 3D full-body hu- ward, the leg trajectory with respect to the body coordinates manoid robot, QRIO (Ishida, Kuroki, & Yamaguchi 2003; in the sagittal plane can be roughly approximated by the shape of an ellipsoid. Suppose the output trajectories of the Ishida & Kuroki 2004). l oscillators can be approximated as px = Ax cos(ωt + αx) l and pz = Az cos(ωt + αz), respectively. Then, to form the Neural Oscillator Model l l ellipsoidal trajectory on the X-Z plane, px and pz need to We use the neural oscillator model proposed by Matsuoka l l satisfy the relationship px = Ax cos φ and pz = Az sin φ, (Matsuoka 1985), which is widely used as a CPG in robotic where φ is the angle defined in Fig. 3. Thus, the desired applications (Kimura, Fukuoka, & Cohen 2003; Williamson phase difference between vertical and horizontal oscillation 1998): 6 τCPGz˙j = −zj − wij qj − γzj + c + aj, (1) j=1

˙ τCPGz j = −zj + qj, (2) qj =max(0,zj), (3) where i is the index of the neurons, τCPG, τCPG are time constants for the internal states zj and zj. c is a bias and γ is an adaptation constant. wij is a inhibitory synaptic weight from the j-th neuron to the i-th neuron. qj is an output of each neural unit and aj is a feedback signal which will be defined in the following section. CPG Arrangement In many of the previous applications of neural oscillator based locomotion studies, an oscillator is allocated at each Figure 2: Neural oscillator allocation and biologically in- joint and its output is used as a joint torque command to the spired feedback pathways for a stepping motion in place

AAAI-05 / 1268 should be αx −αz = π/2. To embed this phase difference as The same feedback signal is fed back to a quad-element an intrinsic property, we install a quad-element neural oscil- neural oscillator for the propulsive motion, a3 = a1,toin- lator with uni-directional circular inhibitions (w34 = w43 = duce a cooperative leg movement with π/2 phase difference w56 = w65 =2.0,w35 = w63 = w46 = w54 =0.5). between the Z and X direction (Fig.3). It generates inherent phase difference of π/2 between two For the propulsive leg movements in the X direction, coupled oscillators, (q3 −q4)and(q5 −q6) (Matsuoka 1985). the feedback signal aj are represented by a policy gradient Therefore, if (q3 − q4) is entrained to the vertical leg move- method. ( )= max ( ( )) ments, then an appropriate horizontal oscillation with de- aj t  aj g vj t , (10) sired phase difference is achieved by (q5 − q6). 2 π where g(x)= π arctan 2 x , vj is sampled from a stochas- Similar to the Z direction, the neural output (q5 − q6)is l r tic policy defined by a probability distribution πw(x,vj )= allocated to control the position of both legs px, px along the µ σ P (vj|x; w , w ): X (forward) direction in the sagittal plane: πw(x,vj ) l = − ( − )   px X0 Ax q5 q6 , (6) µ 2 r 1 −(vj −µj(x; w )) = 0 + x ( 5 − 6) = √ exp px X A q q , (7) σ 2 σ , (11) 2πσj(w ) 2σj (w ) where X0 is an offset and Ax is the amplitude scaling factor. x wµ wσ This framework provides us with a basic walking pattern where denotes the state of the robot, and , are pa- rameter vectors of the policy. We can equivalently represent which can be modulated by feedback signals aj. vj by µ σ Sensory Feedback vj(t)=µj(x(t); w )+σj (w )nj(t), (12) ( ) ∼ (0 1) (0 1) With an assumption of symmetric leg movements, feedback where, nj t N , . N , is a normal distribution 0 2 1 signals to the oscillator eqn.(1) can be expressed as which has a mean µ of and a variance σ of .Inthe next section, we discuss the learning algorithm to acquire a a2m = −a2m−1, (m =1, 2, 3). (8) feedback controller aj=5 for the propulsive motion. As a first step of this study, we focus on acquiring a feed- Learning the sensory feedback controller back controller (a5) for the propulsive leg movement in the X direction. Thus, we explicitly design sensory feedback Previous studies have demonstrated promising results to ap- pathways for stepping motion in place (a1) motivated by bi- ply a policy gradient method for POMDP on biped walk- ological observations (Fig.2). We introduce Extensor Re- ing tasks(Tedrake, Zhang, & Seung 2004; Matsubara et al. sponse and Vestibulo-spinal Reflex (Kimura, Fukuoka, & 2005). We use it for acquisition of a policy of the sensory Cohen 2003). Both sensory feedback pathways generate re- feedback controller to the neural oscillator model. covery motion by adjusting leg length according to the ver- To describe the representative motion of the robot, we tical reaction force or the inclination of the body. consider the states of the pelvis of the robot where gyro sen- sors are located, which also roughly approximates the loca- r l a1 = hER (fz − fz) /mg + hVSRθroll, (9) tion of the center of mass (COM) of the system. We chose ˙ ˙ T r l the pelvis angular velocity x =(θroll, θpitch) as the input where (fz − fz) are right/left vertical reaction force differ- states to the learning algorithm. Therefore, we can conceive ences normalized by total body weight mg. hER, hVSR are scaling factors. Stepping motion robustness against pertur- that the rest of the state of the pelvis such as inclinations bation was experimentally discussed in (Endo et al. 2005) with respect to the world coordinates as well as the internal variables in the neural oscillator model in equations (1)(2) and we chose hER =0.4,hVSR =1.8. are hidden variables for the learning algorithm. We show that we can acquire a policy of the sensory feedback aj(x) in (10) by using a policy gradient method even with consid- erable numbers of hidden variables. In the following section, we introduce the definition of the value function in continuous time and space to derive the temporal difference error (TD error)(Doya 2000). Then we discuss the learning method of a policy for the sensory feedback controller. Learning the value function Consider the continuous-time and continuous-states system dx(t) = f(x(t), u(t)), (13) dt where x ∈ X ⊂n is the state and u ∈ U ⊂m is the control input. We denote the immediate reward for the state Figure 3: A quad-element neural oscillator for a propulsive and action as motion in the sagittal plane r(t)=r(x(t), u(t)). (14)

AAAI-05 / 1269 The value function of state x(t) basedonapolicyπ(u(t) | Numerical Simulation Setup x( )) t is defined as Function Approximator for the Value Function and   ∞  the Policy π − s−t  V (x(t))=E e τ r(x(s), u(s))dsπ , (15) We use a normalized Gaussian network (NGnet) (Doya t 2000) to model the value function and the mean of the pol- where τ is a time constant for discounting future rewards. icy. The variance of the policy is modelled by a sigmoidal A consistency condition for the value function is given by function (Kimura & Kobayashi 1998; Peters, Vijayakumar, differentiating the definition (15) by t as & Schaal 2003). The value function is represented by the NGnet: dV π(x(t)) 1 K = V π(x(t)) − r(t). (16) (x; wc)= c (x) dt τ V wkbk , (23) k=1 We denote the current estimate of the value function as where V (x(t)) = V (x(t); wc),wherewc is the parameter of the k(x) T ˆ φ −s (x−ck) function approximator. If the current estimate V of the value bk(x)= ,φk(x)=e k , (24) K (x) function is perfect, it should satisfy the consistency condi- l=1 φl tion (16). If this condition is not satisfied, the prediction k is the number of the basis functions. The vectors ck and sk should be adjusted to decrease the inconsistency, characterize the center and the size of the k-th basis function, 1 respectively. The mean µ and the variance σ of the policy are δ(t)=r(t) − V (t)+V˙ (t). (17) represented by the NGnet and the sigmoidal function: τ K This is the continuous-time counterpart of TD error (Doya µ c µj = wij bi(x), (25) 2000). The update laws for the parameter of the policy wi i c i=1 and the eligibility trace ec of wi are defined respectively as and 1 1 c c c ∂Vw σj = , (26) e˙i (t)=− ei (t)+ , (18) σ c c 1+exp(−wj ) κ ∂wi (x) c c respectively. We locate basis functions φk at even in- w˙ i (t)=αδ(t)ei (t), (19) tervals in each dimension of the input space (−2.0 ≤ ˙ ˙ c θroll, θpitch ≤ 2.0) =15× 15 where α is the learning rate and κ is the time constant of . We used 225( ) basis func- the eligibility trace. In this study, we select the learning pa- tions for approximating the value function and the policy rameters as τ =1.0, α =78, κc =0.5. respectively. Rewards Learning a policy of the sensory feedback We used a reward function: controller r(x)=kH (h1 − h )+kS vx, (27) We can estimate the gradient of the expected total reward a V (t) with respect to the policy parameter w : where h1 is the pelvis height of the robot, h is a thresh- old parameter for h1 and vx is forward velocity with respect ∂ a to the ground. The reward function is designed to keep the a E { V (t) | πwa } = E{δ(t)ei (t)}, (20) ∂wi height of the pelvis by the first term, and at the same time to achieve forward progress by the second term. In this study, a a where wi is the parameter of policy πw and ei (t) is the the parameters are chosen as kS =0.0 − 10.0, kH =10.0, a eligibility trace of the parameter wi . The update law for the h =0.272. The robot receives a punishment (negative re- a a parameter of the policy wi and the eligibility trace ei (t) are ward) r = −1 if it falls over. derived respectively as Experiments 1 ln a ˙a( )=− a( )+∂ πw ei t a ei t a , (21) Simulation Results and Hardware Verifications κ ∂wi The learning sequence is as follows: at the beginning of each a a w˙ i (t)=βδ(t)ei (t), (22) trial, we utilize a hand-designed feedback controller to initi- ate walking gait for several steps. Then we switch the feed- where β is the learning rate and κa is the time constant of back controller for the learning algorithm at random in order the eligibility trace. The definition (20) implies that by using to generate various initial state input. Each trial is terminated a δ(t) and ei (t), we can calculate the unbiased estimator of in the case where the immediate reward is less than -1 or the the gradient of the value function with respect to the param- robot walks for 20 sec. The learning process is considered a µ σ eter wi . We set the learning parameters as β = β = 195, as success when the robot does not fall over in 20 successive κµ =1.0, κσ =0.1. trials.

AAAI-05 / 1270 Figure 6: Snapshots of straight steady walking with acquired feedback controller (Ax = 0.015 m, Az =0.005 m, vx =0.077 m/s. Photos were captured every 0.1 sec.)

Num. of Num. of Achievements slight undulation. Fig. 6 shows snapshots of acquired walk- ks Exp. Sim.(trials) Hard.(trials) ing pattern. Also, we carried out walking experiments on a 0.0 4 1 (2385) 0 (-) slope and the acquired policy achieved steady walking in the 1.0 3 3 (528) 3 (1600) range of +3 to -4 deg inclination, suggesting enough walk- 2.0 3 3 (195) 1 (800) ing stability. 3.0 4 4 (464) 2 (1500) We also verified improvements of the policy by imple- 4.0 3 2 (192) 2 (350) menting the policies with different trials with the same learn- 5.0 5 2 (1011) 1 (600) ing process. With the policy on the early stage of a learn- 10.0 5 5 (95) 5 (460) ing process, the robot exhibited back and forth stepping then immediately fell over. With the policy on the intermediate Sum(Ave.) 27 20 (696) 14 (885) stage, the robot performed unsteady forward walking and Table 1: Achievements of acquisition occasional stepping on the spot. With policy after substan- tial trials, the robot finally achieved steady walking. On average, additional 189 trials in the numerical simula- tion were required for the policy to achieve walking in the 1200 1000 physical environment. This result suggests the learning pro- 800 cess successfully improves robustness against perturbation 600 by using entrainment property. 400 Velocity Control accumulated reward 200 trials 0 To control walking velocity, we investigated the relationship 0 200 400 600 800 between the reward function and the acquired velocity. We Figure 5: Learned pol- set the parameters in eqn.(1)(2) as τCPG =0.105, τCPG = Figure 4: Typical learning curve icy 0.132, c =2.08, γ =2.5 to generate an inherent oscillation where amplitude and period are 1.0 and 0.8, respectively. Since we set Ax=0.015m, expected walking velocity with At the beginning of a learning process, the robot imme- intrinsic oscillation is 0.075m/s. diately fell over within a few steps as expected. The policy We measured average walking velocity both in numerical gradually increased the output amplitude of feedback con- simulations and hardware experiments with various ks in the troller to improve walking motion as the learning proceeded. 0.0 to 5.0 range (Fig. 7). The resultant walking velocity in We did 27 experiments with various velocity reward, ks,and the simulation increased as we increased ks and hardware walking motion was successfully acquired in 20 experiments experiments demonstrated similar tendency. (Table. 1). Typically, it took 20 hours to run one simulation This result shows the reward function works appropriately for 1000 trials and the policy was acquired after 696 trials to obtain a desirable feedback policy, which is difficult for a on average. Fig. 4 and Fig. 5 show a typical example of hand-designed controller to achieve. Also, it would be possi- accumulated reward at each trial and an acquired policy, re- ble to acquire different feedback controllers with some other ˙ criteria such as energy efficiency or walking direction by us- spectively. In Fig.5, while θroll dominates the policy output, ˙ ing the same scheme. θpitch does not have much influence. The reason would be ˙ that θroll is always generating by stepping motion in the Z Stability Analysis direction regardless of the propulsive motion, thus the policy ˙ To quantify the stability of an acquired walking controller, tries to utilize θroll to generate cooperative leg movements. ˙ we consider the periodic walking motion as discrete dynam- On the other hand, θpitch is suppressed by the reward func- ics and analyze the local stability around a fix point using a tion not to decrease the pelvis height by a pitching oscilla- return map. We perturbed target trajectory on (q5 − q6)to tion which usually cause falling. change step length at random timing during steady walking, We implemented the acquired 20 feedback controllers in and captured the states of the robot when left leg touched Table. 1 on the hardware and confirmed that 14 of them suc- down. We measured two steps, right after the perturbation cessfully achieved steady walking on the carpet floor with (xn) and the next step (xn+1). If acquired walking motion is

AAAI-05 / 1271 0.12 400 m/s 0.035 xn+1 350 0.1 0.03 300 250 0.08 0.025 200 0.06 150 0.02 Numerical simulation 100 0.04 accumulated reward acquired average velocity Hardware experiment 0.015 x = -0.0101x + 0.027 50 ks n+1 n trials 0.02 xn 0 0.01 012345 0 50 100 150 200 0.01 0.015 0.02 0.025 0.03 0.035 Figure 7: The relationship between ac- Figure 8: Return map of step length Figure 9: An example of additional on- quired velocity and velocity reward line learning using the physical robot. locally stable, absolute eigenvalue of the return map should nominal trajectory in order to accomplish successful loco- be less than 1. motion. However, it is not possible to design the nominal Fig.8 shows the return map with 50 data points and a trajectory for every possible situation in advance. Thus, white dot indicates a fix point derived by averaging 100 these model-based approaches may not be desirable in an steps without perturbation. The estimated eigenvalue is - unpredictable or dynamically changing environment. 0.0101 calculated by a least squares fit. The results suggests Learning can be one of the alternative approaches, as it that even if step length was reduced to half of the nominal has the potential capability of adapting to environmental step length by perturbation, for example pushed forward, the changes and modeling error. Moreover, learning may pro- feedback controller quickly converges to the steady walking vide us with an efficient way of producing various walking pattern within one step. patterns by simply introducing different higher level crite- ria such as walking velocity and energy efficiency without Additional online learning changing the learning framework itself. As shown in Table. 1, 6 policies obtained in numerical sim- However, humanoid robots have many degrees of free- ulations could not achieve walking with the physical robot dom, we cannot directly apply conventional reinforcement in the real environment. Even in this case, we could make learning methods to the robots. To cope with this large- online additional learning using a real hardware because a scale problem, Tedrake et al. exploited inherent passive dy- calculation cost of the numerical simulation is mainly due namics. The question is whether we can easily extend their to the dynamics calculation, not to the learning algorithm approach to a general humanoid robot. They used the de- itself. In this section, we attempt to improve the obtained sired state on the return map taken from the gait of the robot policies in numerical simulations which could not originally walking down on a slope without actuation in order to de- produce steady walking in the hardware experiments. fine the reward function for reinforcement learning. Their For the reward function, walking velocity was calculated learning algorithm owes much to the intrinsic passive dy- by the relative velocity of the stance leg with respect to the namical characteristics of the robot that can walk down a pelvis and the body height was measured by the the joint slope without actuation. However, their approach cannot be angles of the stance leg and the absolute body inclination directly applied to general humanoid robots that are not me- derived from integration of gyration sensor. We introduced chanically designed with specific dynamical characteristics digital filters to cut off the measurement noise. Note that we for only walking. did not introduce any external sensor for this experiment. Therefore, in this study, instead of making use of passive Despite delayed and inaccurate reward information, the dynamics of the system, we proposed to use CPGs to gener- online learning algorithm successfully improved the ini- ate basic periodic motions and reduce the number of dimen- tial policy and performed steady walking within 200 trials sions of the state space for the learning system exploiting (which took us 2.5 hours to do). Fig. 9 shows an example of entrainment of neural oscillators for biped locomotion for online additional learning where ks =4.0, kh =10.0 and a full-body humanoid robot. As a result, we only consid- h =0.275. (Note that the value of the accumulated reward ered 2-dimensional state space for learning. This is a drastic differs from the simulated result in Fig. 4 due to different reduction of the state space from the total 36 dimensions in- time duration 16sec for one trial.) The gray line indicates cluding the robot and CPG dynamics if we would treat this running average of accumulated reward for 20 trials. as an MDP assuming that all the states are measurable. Since we only considered low-dimensional state space, Discussion we had to treat our learning problem as a POMDP. Although In this section, we would like to discuss our motivation of it is computationally infeasible to calculate the optimal value the proposed framework. function for the POMDP by using dynamic programming in Most of the existing humanoids utilize a pre-planned high-dimensional state space (Kaelbling, Littman, & Cas- nominal trajectory and requires precise modeling and pre- sandra 1998), we can still attempt to find a locally opti- cise joint actuation with high joint control gains to track the mal policy by using policy gradient methods (Kimura &

AAAI-05 / 1272 Kobayashi 1998; Jaakkola, Singh, & Jordan 1995; Sutton et Hase, K., and Yamazaki, N. 1998. Computer simulation al. 2000). We succeeded to acquire the biped walking con- of the ontogeny of biped walking. Anthropological Science troller by using the policy gradient method with the CPG 106(4):327–347. controller within a feasible numbers of trials. One of impor- Ishida, T., and Kuroki, Y. 2004. Development of sensor tant factors to have the successful result was selection of the system of a small biped entertainment robot. In IEEE In- input states. Inappropriate selection of the input states may ternational Conference on Robotics and Automation, 648– cause large variance of the gradient estimation and may lead 653. to the large number of trials. Though we manually selected ˙ ˙ Ishida, T.; Kuroki, Y.; and Yamaguchi, J. 2003. Mechanical θroll and θpitch as the input states for our learning system, system of a small biped entertainment robot. In IEEE In- development of a method to automatically select input states ternational Conference on Intelligent Robots and Systems, forms part of our future work. 1129–1134. Jaakkola, T.; Singh, S. P.; and Jordan, M. I. 1995. Re- Conclusion inforcement learning algorithm for partially observable In this paper, we proposed an efficient learning framework Markov decision problems. 7:345–352. for CPG-based biped locomotion controller using the policy Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. gradient method. We decomposed a walking motion into a 1998. Planning and acting in partially observable stochas- stepping motion in place, and propulsive motion and feed- tic domains. Artificial Intelligence 101:99–134. back pathways for the propulsive motion were acquired us- Kimura, H., and Kobayashi, S. 1998. An analysis of ac- ing the policy gradient method. Despite considerable num- tor/critic algorithms using eligibility traces: Reinforcement ber of hidden variables, the proposed framework success- learning with imperfect value function. In International fully obtained a walking pattern within 1000 trials on av- Conference on Machine Learning, 278–286. erage in the simulator. Acquired feedback controllers were implemented on a 3D hardware robot and demonstrated ro- Kimura, H.; Fukuoka, Y.; and Cohen, A. H. 2003. Biolog- bust walking in the physical environment. We discuss ve- ically inspired adaptive dynamic walking of a quadruped locity control and stability as well as a possibility of online robot. In 8th International Conference on the Simulation additional learning with the hardware robot. of Adaptive Behavior, 201–210. We plan to investigate energy efficiency and robustness Konda, V., and Tsitsiklis, J. 2003. On actor-critic algo- against perturbation using the same scheme as well as the ac- rithms. Society for Industrial and Applied Mathematics quisition of the policy for stepping motion. Also, we plan to 42(4):1143–1166. implement the proposed framework on a different platform Matsubara, T.; Morimoto, J.; Nakanishi, J.; and Doya, K. to demonstrate sufficient capability to handle configuration 2005. Learning feedback pathways in CPG with policy changes. We will introduce additional neurons to generate gradient for biped locomotion. In IEEE International Con- more natural walking. ference on Robotics and Automation, 4175–4180. Matsuoka, K. 1985. Sustained oscillations generated by Acknowledgements mutually inhibiting neurons with adaptation. Biological We would like to thank Seiichi Miyakoshi of the Digital Hu- Cybernetics 52:345–353. man Research Center, AIST, Japan and Kenji Doya of Initial Mori, T.; Nakamura, Y.; Sato, M.; and Ishii, S. 2004. Research Project, Okinawa Institute of Science and Technol- Reinforcement learning for a cpg-driven biped robot. In ogy, for productive and invaluable discussions. We would Nineteenth National Conference on Artificial Intelligence, like to thank all the persons concerned in QRIO’s develop- 623–630. ment for supporting this research. Peters, J.; Vijayakumar, S.; and Schaal, S. 2003. Rein- forcement learning for humanoid robotics. In International References Conference on Humanoid Robots. (CD-ROM). Benbrahim, H., and Franklin, J. A. 1997. Biped dynamic Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. walking using reinforcement learning. Robotics and Au- 2000. Policy gradient methods for reinforcement learning tonomous Systems 22:283–302. with imperfect value function. Advances in Neural Infor- mation Processing Systems 12:1057–1063. Cohen, A. H. 2003. Control principle for locomotion – Taga, G. 1995. A model of the neuro-musculo-skeletal looking toward biology. In 2nd International Symposium system for human locomotion I. emergence of basic gait. on Adaptive Motion of Animals and Machines.(CD-ROM, Biological Cybernetics 73:97–111. TuP-K-1). Tedrake, R.; Zhang, T. W.; and Seung, H. S. 2004. Stochas- Doya, K. 2000. Reinforcement learning in continuous time tic policy gradient reinforcement learning on a simple 3D and space. Neural Computation 12:219–245. biped. In International Conference on Intelligent Robots Endo, G.; Nakanishi, J.; Morimoto, J.; and Cheng, G. and Systems, 2849–2854. 2005. Experimental studies of a neural oscillator for biped Williamson, M. 1998. Neural control of rhythmic arm locomotion with QRIO. In IEEE International Conference movements. Neural Networks 11(7–8):1379–1394. on Robotics and Automation, 598–604.

AAAI-05 / 1273