2 Machine Learning Algorithms in Bipedal Robot Control Shouyi Wang, Wanpracha Chaovalitwongse and Robert Babuskaˇ
Abstract—In the past decades, machine learning techniques, statically stable double-support phase (both feet in contact such as supervised learning, reinforcement learning, and unsu- with the ground) and the statically unstable single-support pervised learning, have been increasingly used in the control phase (only one foot contacts with the ground). Suitable engineering community. Various learning algorithms have been developed to achieve autonomous operation and intelligent deci- control strategies are required for step-to-step transitions. sion making for many complex and challenging control problems. • Under-actuated system: Walking robots are unconnected One of such problems is bipedal walking robot control. Although to the ground. Even if all joints of a bipedal robot are still in their early stages, learning techniques have demonstrated controlled perfectly, it is still not enough to completely promising potential to build adaptive control systems for bipedal control all the degrees of freedom (DOFs) of the robot. robots. This paper gives a review of recent advances on the state- • of-the-art learning algorithms and their applications to bipedal Multi-variable system: Walking systems usually have robot control. The effects and limitations of different learning many DOFs, especially in 3-dimensional spaces. The techniques are discussed through a representative selection of interactions between DOFs and the coordination of multi- examples from the literature. Guidelines for future research on joint movements have been recognized as a very difficult learning control of bipedal robots are provided in the end. control problem. Index Terms—Bipedal walking robots, learning control, super- • Changing environments: Bipedal robots have to be vised learning, reinforcement learning, unsupervised learning adaptive to uncertainties and respond to environmental changes correctly. For example, the ground may become I.INTRODUCTION uneven, elastic, sticky, soft or stiff; there may be obstacles Bipedal robot control is one of the most challenging and on the ground. A bipedal robot has to adjust its control popular research topics in the field of robotics. We have strategies fast enough to such environmental changes. witnessed an escalating development of bipedal walking robots based on various types of control mechanisms. However, In recent years, the great advances in computing power have unlike the well-solved classical control problems (e.g., control enabled the implementation of highly sophisticated learning of industrial robot arms), the control problem of bipedal robots algorithms in practice. Learning algorithms are among the is still far from being fully solved. Although many classical most valuable tools to solve complex problems that need model-based control techniques have been proposed to bipedal ‘intelligent decision making’, and to design truly intelligent robot control, such as trajectory tracking control [76], robust machines with human-like capabilities. Robot learning is a control [105], and model predictive control [57], these control rapidly growing area of research at the intersection of robotics laws are generally pre-computed and inflexible. The resulting and machine learning [22]. With a classical control approach, bipedal robots are usually not satisfactory in terms of sta- a robot is explicitly programmed to perform the desired task bility, adaptability and robustness. There are five exceptional using a complete mathematical model of the robot and its en- characteristics of bipedal robots that present challenges and vironment. The parameters of the control algorithms are often constrains to the design of control systems: chosen by hand after extensive empirical testing experiments. • Non-linear dynamics: Bipedal robots are highly non- On the other hand, in a learning control approach, a robot is linear and naturally unstable systems. The well-developed only provided with a partial model, and a machine learning classical control theories for linear systems cannot be algorithm is employed to fine-tune the parameters of the con- applied directly. trol system to acquire the desired skills. A learning controller • Discretely changing in dynamics: Each walking cycle is capable of improving its control policy autonomously over consists of two different situations in a sequence: the time, in some sense tending towards an ultimate goal. Learning control techniques have shown great potential of adaptability Shouyi Wang is with the Department of Industrial and Manufacturing and flexibility, and thus become extremely active in recent Systems Engineering, University of Texas at Arlington, Arlington, TX 76019, USA. years. There have been a number of successful applications E-mail: [email protected] of learning algorithms on bipedal robots [11], [25], [51], Wanpracha Chaovalitwongse is with the department of Industrial and Sys- [82], [104], [123]. Learning control techniques appear to be tems Engineering and the department of Radiology, University of Washington, Seattle, WA 98104, USA. promising in making bipedal robots reliable, adaptive and E-mail: [email protected] versatile. In fact, building intelligent humanoid walking robots Robert Babuskaˇ is with the Delft Center for Systems and Control, Delft has been one of the main research streams in machine learning. University of Technology, Delft, 2628CD Netherlands. E-mail: [email protected] If such robots are ever to become a reality, learning control techniques will definitely play an important role. 3
There are several comprehensive reviews of bipedal walking in training. To achieve good performance of generalization, the robots [16], [50], [109]. However, none of them has been training data set should contain a fully representative collection specifically dedicated to providing the review of the state-of- of data so that a valid general mapping between inputs and the-art learning techniques in the area of bipedal robot control. outputs can be found. SL is one of the most frequently used This paper aims at bridging this gap. The main objectives learning mechanisms in designing learning systems. A large of this paper are two-fold. The first goal is to review the number of SL algorithms have been developed in the past recent advances of mainstream learning algorithms. And the decades. They can be categorized into several major groups second objective is to investigate how learning techniques as discussed in the following. can be applied to bipedal walking control through the most 1) Neural Networks (NNs): NNs are powerful tools that representative examples. have been widely used to solve many SL tasks where there The rest of the paper is organized as follows. Section II, exists sufficient amount of training data. There are several pop- presents an overview of the three major types of learning ular learning algorithms for training NNs (such as Perceptron paradigms, and surveys the recent advances of the most in- learning rule, Widrow-Hoff rule), but the most well-known fluential learning algorithms. Section III provides an overview and commonly used one is backpropagation (BP) developed of the background of bipedal robot control, including stability by Rumelhart in the 1980’s [88]. BP adjusts the weights of a criteria, classical model-based and biological-inspired control NN by calculating how the error changes as each weight is approaches. Section IV presents the state-of-the-art learning increased or decreased slightly. The basic update rule of BP control techniques that have been applied to bipedal robots. is given by: Section V gives a technical comparison of learning algorithms ∂E ωj = ωj − α , (1) by their advantages and disadvantages. Finally, we identify ∂ωj some important open issues and promising directions for future where α is the learning rate that controls the size of weight research. changes at each iteration, ∂E is the partial derivative of the ∂ωj error function E with respect to weight ωj. BP-based NNs II.LEARNING ALGORITHMS have become popular in practice since they can often find a Learning algorithms specify how the changes in a learner’s good set of weights in a reasonable amount of time. They behavior depend on the inputs it received and on the feedback can be used to solve many problems involve large amounts of from the environment. Given the same input, a learning agent data and complex mapping relationships. As a gradient-based may respond differently later on than it did earlier on. With method, BP is subject to the local minima problem, which respect to the sort of feedback that a learner has access to, is inefficient in searching of global optimal solutions. One of learning algorithms generally fall into three broad categories: the approaches to tackle this problem is to try different initial supervised learning (SL), reinforcement learning (RL) and weights until a satisfactory solution is found [119]. unsupervised learning (UL). The basic structures of the three In general, the major advantage of NN-based SL methods learning paradigms are illustrated in Figure 1. is that they are convenient to use and one does not have to understand the solution in great detail. For example, one does T e a c h e r / T a r g e t s not need to know anything about a robot’s model; a NN can be trained to estimate the robot’s model from the input-output E r r o r data of the robot. However, the drawback is that the learned NN is usually difficult to interpret due to its complicated S u p e r v i s e d structure. I n p u t L e a r n i n g O u t p u t 2) Locally Weighted Learning (LWL): Instead of mapping nonlinear functions globally (such as BP), LWL represents an- R e w a r d P e r f o r m a n c e other class of methods which fit complex nonlinear functions E v a l u a t i o n by local kernel functions. A demonstration of LWL is shown R e i n f o r c e m e n t in Figure 2. There are two major types of LWL: memory-based I n p u t L e a r n i n g O u t p u t LWL which simply stores all training data in memory and uses efficient interpolation techniques to make predictions of new inputs [1]; non-memory-based LWL which constructs compact U n s u p e r v i s e d representations of training data by recursive techniques so as I n p u t L e a r n i n g O u t p u t to avoid storing large amounts of data in memory [62], [107]. The key part of all LWL algorithms is to determine the region Fig. 1. Basic structures of the three learning paradigms: supervised learning, of validity in which a local model can be trusted. Suppose there reinforcement learning and unsupervised learning. are K local models, the region of validity can be calculated from a Gaussian kernel by:
A. Supervised Learning (SL) 1 T ωk = exp(− (x − ck) Dk(x − ck)), (2) SL is a machine learning mechanism that first finds a map- 2 ping between inputs and outputs based on a training data set, where ck is the center of the k-th linear model, and Dk is and then makes predictions to the inputs that it has never seen the distance metric that determines the size and shape of the 4 validity region of the kth linear model. Given a query point a data instance in group A or B violates the hyperplane x, every linear model calculates a prediction yˆk(x) based on constraint. The objective function is, therefore, minimizing the the obtained local validity. Then the output of LWL is the average misclassifications subject to the hyperplane constraint normalized weighted mean of all K linear models calculated for separating data instances of A from data instances of B. by: ∑ The training of a SVM obtains a global solution instead of K ω yˆ local optimum. However, one drawback of SVM is that the yˆ = ∑k=1 k k . (3) K results are sensitive to the choices of the kernel function. The ωk k=1 problem of choosing appropriate kernel functions is still left to user’s creativity and experience. O u t p u t 4) Decision Tree: Decision trees use a hierarchical tree model to classify or predict data instances. Given a set of training data with associated attributes, a decision tree can be + + induced by using algorithms such as ID3 [83], CART [13], + and C4.5 [84]. While ID3 and C4.5 are primarily suitable for classification tasks, CART has been specifically developed for regression problems. The most well-know algorithm is l o c a l k e r n e l f u n c t i o n C4.5 [84] which builds decision trees by using the concept I n p u t of Shannon entropy [98]. Based on the assumption that each attribute of data instances can be used to make a decision, Fig. 2. A schematic view of locally weighted regression. C4.5 examines the relative entropy for each attribute and ac- LWL achieves low computational complexity and efficient cordingly splits the data set into smaller subsets. The attribute learning in high dimensional spaces. Another attractive feature with the highest normalized information gain is used to make of LWL is that local models can be allocated as needed, and decisions. Ruggieri [87] provided an efficient version of C4.5, the modeling process can be easily controlled by adjusting called EC4.5, which is claimed to be able to achieve a perfor- the parameters of the local models. LWL techniques have been mance gain up to five times while compute the same decision used quite successfully to learn inverse dynamics or kinematic trees as C4.5. Olcay and Onur [120] present three parallel mappings in robot control systems [6], [7]. One of the most C4.5 algorithms which are designed to be applicable to large popular LWL algorithms is called locally weighted projection data sets. Baik and Bala [9] present a distributed version of regression (LWPR), which has shown good capability to solve decision tree, which generates partial trees and communicates several on-line learning problems of humanoid robot control the temporary results among them in a collaborative way. in [108]. The distributed decision trees are efficient for large data sets 3) Support Vector Machine (SVM): SVM is a widely used collected in a distributed system. classification technique in machine learning [20]. It has been One of the most useful characteristics of decision trees is used in pattern recognition and classification problems, such that they are simple to understand and easy to interpret. People as handwritten recognition [96], speaker identification [95], can understand decision tree models after a brief explanation. face detection in images [74], and text categorization [42]. It should be noticed that a common assumption made in The most important idea of SVM is that every data instance decision trees is that data instances belonging to different can be classified by a hyperplane if the data set is transformed classes have different values in at least one of their attributes. into a space with sufficiently high dimensions [14]. Therefore, Therefore, decision trees tend to perform better when dealing a SVM first projects input data instances into a higher dimen- with discrete or categorical attributes, and will encounter prob- sional space, and then divides the space with a separation lems when dealing with continuous data. Moreover, another hyperplane which not only minimizes the misclassification limitation of decision trees is that they are usually sensitive to error but also maximizes the margin separating the two classes. noise. One of the most successful optimization formalism of SVM is based on robust linear programming. Consider two data groups in the n-dimensional real-space Rn, optimization formalism is B. Reinforcement Learning given by: Among other modes of learning, humans heavily rely on ey ez learning from interaction, repeating experiments with small min + (4) ω,γ,y,z m k variations and then finding out what works and what does s.t. Aω − eγ − e ≥ y (5) not. Consider a child learning to walk – it tries out various −Bω + eγ − e ≥ z (6) movements, some actions work and are rewarded (moving forward), while others fail and are punished (falling). Inspired y ≥ 0, z ≥ 0, (7) by animal and human learning, the Reinforcement Learning where A is a m × n matrix representing m observations (RL) approach enables an agent to learn a mapping from states in group one, and B is a k × n matrix representing k to actions by trial and error so that the expected cumulative observations in group two. The two data groups are separated reward in the future is maximized. by a hyperplane (defined by Aω ≥ eγ, Bω ≤ eγ), and 1) General RL Scheme: RL is capable of learning while y and z are binary {0, 1} decision variables indicating if gaining experience through interactions with environments. 5
It provides both qualitative and quantitative frameworks for understanding and modeling adaptive decision-making prob- P o l i c y ( a c t o r ) lems in the form of rewards and punishments. There are three T D e r r o r fundamental elements in a typical RL scheme: s t a t e s • state set S, in which a state s ∈ S describes a system’s V a l u e F u n c t i o n ( c r i t i c ) current situation in its environment, • action set A, from which an action a ∈ A is chosen at r e w a r d s c o n t r o l s i g n a l s the current state , R o b o t • scalar reward r ∈ R indicates how well the agent is currently doing with respect to the given the task. Fig. 3. The actor-critic learning architecture for robot control. At each discrete time step t, a RL agent receives its state information st ∈ S, and takes an action at ∈ A to interact with its environment. The action at changes its environment P o l i c y state from st to st+1 and this change is communicated to the learning agent through a scalar reward rt+1. Usually, the T D e r r o r sign of reward indicates whether the chosen action at was s t a t e s a c t i o n good (positive reward) or bad (negative reward). The RL agent A c t i o n - V a l u e F u n c t i o n Q attempts to learn a policy that maps state st to action at so that r e w a r d the sum of the expected future reward Rt is maximized. The sum of future rewards is usually formulated in a discounted c o n t r o l s i g n a l s R o b o t way [102], which is given by: ∑∞ 2 k Fig. 4. The Q-learning architecture for robot control. Rt = rt+1 + γrt+2 + γ rt+2 + ... = γ rt+k+1, (8) k=0 to scale up RL algorithms to high-dimensional state-action where γ is called the discounting rate that satisfies 0 < γ < 1. spaces. Recently, policy-gradient methods have attracted great Applications of RL have been reported in areas such as attention in RL research since they are considered to be robotics, manufacturing, computer game playing, and econ- applicable to high-dimensional systems. The policy-gradient omy [60]. Recently, RL has also been used in psychology RL have been applied to some complex systems with many and cognitive models to simulate human learning in problem- degree of freedoms (DOFs), such as robot walking [25], [55], solving and skill acquisition [31]. [70], [104], and traffic control [86]. Peters et al. [77] made a 2) Two Basic RL Structures: Many RL algorithms are comprehensive survey of policy-gradient-based RL methods, available in the literature. The key element of most of them and developed a class of RL algorithms called natural actor- is to approximate the expected future rewards for each state critic learning, for which the action policy was updated based or each state-action pair (under the current policy). There are on natural policy gradients [48]. The efficiency of the proposed two prevalent RL structures: actor-critic scheme [56] and Q- learning algorithms was demonstrated by a 7-DOF real robot learning scheme [114] algorithms. arm which was programmed to learn hit a baseball. The natural • An actor-critic algorithm has two separate function ap- actor-critic algorithm is currently considered the best choice proximators for action policy and state values, respec- among the policy-gradient methods [78]. In recent years, hi- tively. The learned policy function is known as actor, erarchical RL approaches have also been developed to handle because it is used to select actions. The estimated value the curse of dimensionality [61]. Multi-agent or distributed function is known as critic since it evaluates the actions RL are also an emerging topic in current research of RL [33]. made by the actor. The value function and policy function Some researchers also use predictive state representation to are usually both updated by temporal difference error. improve the generalization of RL [85]. • Q-learning algorithms learn a state-action value function, known as Q-function, which is often represented by a lookup table indexed by state-action pairs. Since Q-table C. Unsupervised Learning is constructed on state-action space rather than just state UL is inspired by the brain’s ability to extract patterns space, it discriminates the effects of choosing different and recognize complex visual scenes, sounds, and odors from actions in each state. Compared to actor-critic algorithms, sensory data. It has roots in neuroscience/psychology and is Q-learning is easier to understand and implement. based on information theory and statistics. An unsupervised The basic structure of actor-critic learning and Q-learning learner receives no feedback from its environment at all. It algorithms are shown in Figure 4 and Figure 3, respectively. only responds to the received inputs. At first glance this seems 3) Recent Advances in RL: Most RL algorithms suffer from impractical, since how can we train a learner if we do not know the curse of dimensionality as the number of parameters to be what it is supposed to do. Actually most of these algorithms learned grows exponentially with the size of the state space. perform some kind of clustering or associative rule learning. Thus most of the RL methods are not applicable to high- 1) Clustering: Clustering is the most important form of UL. dimensional systems. One of the open questions in RL is how It deals with data that have not been pre-classified in any way, 6 and does not need any type of supervision during its learning III.BACKGROUNDOF BIPEDAL WALKING CONTROL process. Clustering is a learning paradigm that automatically According to a US army report, more than 50 percent of partitions input data into meaningful clusters based on the the earth surface is inaccessible to traditional vehicles with degree of similarity. wheels or tracks [5], [10]. However, we have to transport over The most well-known clustering algorithm is k-means clus- rough terrains in many real-world tasks, such as emergency tering, which finds k cluster centers that minimize a squared- rescue in isolated areas with unpaved roads, relief after a error criterion function [23]. Cluster centers are represented by natural disaster, and alternatives for human labor in dangerous the gravity center of data instances; that is, the cluster centers working environments. To date, the devices available to assist are arithmetic means of all data samples in the cluster. k- people in such tasks are still very limited. As promising means clustering assigns each data instance to a cluster whose tools to solve these problems, bipedal robots have become center is nearest to it. Since k-means clustering generates one of the most exciting and emerging topics in the field of partitions such that each pattern belongs to one and only robotics. Moreover, bipedal robots can also be used to develop one cluster, the obtained clusters are disjoint. Fuzzy c-means new types of rehabilitation tools for disabled people and to (FCM) was developed to allow one data instance to belong to help elderly with household work. The important prospective two or more clusters rather than just being assigned completely applications of bipedal walking robots are shown in Figure 5. to one cluster [24]. Each data instance is associated with each cluster by a membership function, which indicates the degree of membership to that cluster. The FCM algorithm E n t e r t a i n m e n t W o r k A s s i s t a n t finds the weighted mean of each cluster and then it assigns a membership degree to each data sample in the cluster. For example, data samples on the edge of a cluster belong to the I m e r g e n c y r e s c u e R e p l a c e h u m a n i n cluster to a lower degree than the data around the center of a n d s e a r c h a f t e r a d a n g e r o u s w o r k i n g the cluster. n a t u r e d i s a s t e r e n v i r o n m e n t s Recently, distributed clustering algorithms have attracted considerable attention to extract knowledge from large data sets [41], [4]. Instead of being transmitted to a central site, E f f i c i e n t t r a n s p o r t R e h a b i l i t a t i o n data can be firstly clustered independently at different local o n r o u g h t e r r a i n s sites. Then in the subsequent step, the central site establishes a global clustering based on the local clustering results. Fig. 5. Prospective applications of bipedal walking robots. 2) Hebbian Learning: The key idea of Hebbian learning [37] is that neurons with correlated activity increase their A. Stability Criteria in Bipedal Robot Control synaptic connection strength. It is used in artificial neural Bipedal robot walking can be broadly characterized as networks to learn associations between patterns that frequently static walking, quasi-dynamic walking and dynamic walking. occur together. The original Hebb’s hypothesis does not ex- Different types of walking are generated by different walking plicitly address the update mechanism for synaptic weights. stability criteria as follows. A generalized version of Hebbian learning, called differential • Static Stability: the position of Center of Mass (COM) Hebbian rule [58], [54] can be used to update the synaptic and Center of Pressure (COP) are often used as stability weights. The basic update rule of differential Hebbian learning criteria for static walking. A robot is considered stable if is given by its COM or COP is within the convex hull of the foot support area. Static stability is the oldest and the most wnew = wold + η∆x ∆y , (9) ij ij i j constrained stability criterion, often used in early days of bipedal robots. A typical static walking robot is SD-2 where wij is the synaptic strength from neuron i to neuron j, built by Salatian et al. [89]. ∆xi and ∆yj denote the temporal changes of presynaptic and • Quasi-Dynamic Stability: the most well known criterion postsynaptic activities, and η is the learning rate to control how for quasi-dynamic walking is based the concept of Zero fast the weights get modified in each step. Notably, differential Moment Point (ZMP) introduced by M. Vukobratovic´ Hebbian learning can be used to model simple level of adaptive et al. in [111]. ZMP is a point on the ground where control that is analogous to self-organizing cortical function in the resultant of the ground reaction force acts. A stable humans. It can be applied to construct an unsupervised, self- gait can be achieved by making the ZMP of a bipedal organized learning control system for a robot to interact with robot stay within the convex hull of the foot support area its environment without any evaluative information. Although during walking. ZMP is frequently used as a guideline in it seems to be a low level of learning, Porr and Wo¨rgo¨tter [80] designing reference walking trajectories for many bipedal showed that this autonomous mechanism can develop rather robots. An illustration of the ZMP criterion is shown in complex behavioral patterns in closed loop feedback systems. Figure 6. Recently, Sardain and Bessonnet [92] proposed They confirmed this idea on a real bipedal robot, which was a virtual COP-ZMP, which extended the concept of ZMP capable of walking stably using the unsupervised differential to stability on uneven terrains. Another criterion for Hebbian learning [32]. quasi-dynamic walking is the foot rotation point (FRI), 7
TABLE I which is a point on the ground where the net ground WALKING SPEED OF BIPEDAL ROBOTS USING DIFFERENT STABILITY CRITERIA (THE RELATIVE SPEED = WALKING SPEED/LEGLENGTH) reaction force acts to keep the foot stationary [36]. This walking stability requires keeping the FRI point within Stability Relevant Walking speed Leg length Relative speed Criterion Robots (m/s) (m) the convex hull of the foot support area. One advantage Static SD-2 [89] 0.12 0.51 0.24 of FRI point is that it is capable of indicating the severity HRP-2L [46] 0.31 0.69 0.44 Quasi QRIO [26] 0.23 0.30 0.77 of instability. The longer the distance between FRI and -Dynamic ASIMO [38] 0.44 0.61 0.72 the foot support boundary, the greater the degree of Meta [97] 0.58 0.60 0.97 instability. Dynamic Rabbit [17] 1.2 0.8 1.50 RunBot [32] 0.8 0.23 3.48 Z M P = C e n t e r o f G r o u n d R e a c t i o n B i p e d a l W a l k i n g C o n t r o l
D y n a m i c M o d e l B i o l o g i c a l l y - I n s p i r e d B a s e d A p p r o a c h A p p r o a c h
V a r i o u s T r a j e c t o r y N e u r a l P a s s i v e - F u z z y L o g i c E v o l u t i o n a r y T r a c k i n g &