A Bayesian Approach to Nonlinear Parameter Identification for Rigid Body Dynamics

Jo-Anne Ting∗, Michael Mistry∗, Jan Peters∗, Stefan Schaal∗† and Jun Nakanishi†‡

∗Computer Science, University of Southern California, Los Angeles, CA, 90089-2520 †ATR Computational Neuroscience Labs, Kyoto 619-0288, Japan ‡ICORP, Japan Science and Technology Agency Kyoto 619-0288, Japan Email: {joanneti, mmistry, jrpeters, sschaal}@usc.edu, [email protected]

Abstract— For robots of increasing complexity such as hu- robots such as humanoid robots have significant additional manoid robots, conventional identification of rigid body dynamics nonlinear dynamics beyond the rigid body dynamics model, models based on CAD data and actuator models becomes due to actuator dynamics, routing of cables, use of protective difficult and inaccurate due to the large number of additional nonlinear effects in these systems, e.g., stemming from stiff shells and other sources. In such cases, instead of trying to wires, hydraulic hoses, protective shells, skin, etc. Data driven explicitly model all possible nonlinear effects in the robot, parameter estimation offers an alternative model identification empirical system identification methods appear to be more method, but it is often burdened by various other problems, useful. Under the assumption that a rigid body dynamics such as significant noise in all measured or inferred variables (RBD) model is sufficient to capture the entire robot dynamics, of the robot. The danger of physically inconsistent results also exists due to unmodeled nonlinearities or insufficiently rich data. this problem is theoretically straightforward as all unknown In this paper, we address all these problems by developing a parameters of the robot such as mass, center of mass and Bayesian parameter identification method that can automatically inertial parameters appear linearly in the rigid body dynamics detect noise in both input and output data for the regression equations [3]. Hence, after an appropriate re-arrangement of algorithm that performs system identification. A post-processing the RBD equations of motion, parameter identification can be step ensures physically consistent rigid body parameters by nonlinearly projecting the result of the Bayesian estimation onto performed with techniques. constraints given by positive definite inertia matrices and the Several problems, however, make this seemingly straight- parallel axis theorem. We demonstrate on synthetic and actual forward empirical RBD system identification approach more robot data that our technique performs parameter identification challenging. Firstly, for high dimensional robotic systems, it is with 5 to 20% higher accuracy than traditional methods. Due not easy to generate sufficiently rich data so that all parameters to the resulting physically consistent parameters, our algorithm enables us to apply advanced control methods that algebraically will be properly identifiable. Secondly, sensory data collected require physical consistency on robotic platforms. from a robot is noisy. Noise sources exist in both input and output data, and this effect is additionally amplified by I. INTRODUCTION numerical differentiation done to obtain derivative data from Advanced robot control algorithms usually rely on model- the sensors. Traditional linear regression techniques are only based control techniques in order to accomplish a desired level capable of dealing with output noise, and the presence of input of accuracy and compliance. Typical examples include com- noise introduces a persistent bias to the regression solution. puted torque control, inverse dynamics control and operational Digital filtering of the data can eliminate some noise, but it space control [1], [2]. Depending on their sophistication, these often eliminates important structure as well. Techniques exist model-based controllers also have different levels of demands such as Total (TLS) [4], [5], otherwise known as on the quality of the identified robot model. For instance, orthogonal least-squares regression [6], to perform parameter computed torque control is generally rather insensitive to estimation, but it assumes that the variances of input noise and modeling errors, while operational space control, with its output noise are the same [7]. In real-world systems where this explicit use of both the rigid body dynamics inertia matrix and assumption does not hold, the resulting parameter estimates the centripetal/Coriolis force vector, degrades significantly in will be biased [8]. Thirdly, there is no mechanism in the face of modeling errors. Thus, accurate model identification regression problem for RBD model identification that ensures is a highly important topic for advanced robot control, and the identified parameters are physically plausible. Particularly many modern robotics applications rely on it (e.g., as in haptic in the light of insufficiently rich data and nonlinearities beyond robotic devices, robotic surgery and the safe application of the RBD model, one often encounters physically incorrectly compliant assistive robots in human environments). identified parameters such as negative values on the diagonal Ideally, system identification can be performed based on the of an inertial matrix. Using physically incorrect data in model- CAD data of a robot provided by the manufacturer, at least based control leads to dangerously unstable controllers. The in the context of rigid body dynamic systems—which will be final problem with empirical RBD system identification is that the scope of this paper. However, many modern light-weight some RBD parameters of a robot are not identifiable at all [3]. As a result, the regression problem for RBD parameter estima- to be suitable for this dimensionality. The re-arrangement of tion is almost always numerically ill-conditioned and bears the the RBD equations for parameter estimation creates a matrix danger of generating parameter estimates that strongly deviate where the input vectors x are arranged in the rows of the from the true values, despite a seemingly low error fit of the matrix X and the corresponding scalar outputs y are the data. coefficients of the vector y. A general model for such linear Various methods exist to deal with some of the problems regression with noise-contaminated input and output data is: mentioned above such as regression based on singular-value d decomposition (SVD), ridge regression to cope with the ill- y = wzmtm + y conditioned data [9], or TLS and orthogonal-least squares (1) mX=1 to address input noise [6]. Nevertheless, a comprehensive x = w t +  approach to address the entire set of issues has not been m xm m xm suggested so far. Recent work such as [10] has addressed where d is the number of input dimensions, t is noiseless input the problem of input noise, but in the context of system data composed of tm components, wz and wx are regression identification of a time-series, while ignoring the problems as- vectors composed of wzm and wxm components respectively, sociated with ill-conditioned data in high dimensional spaces. and y and x are additive mean-zero noise. Only X and y In this paper, we suggest a Bayesian estimation approach to are observable. Note that if the input data is noiseless (that the RBD parameter estimation problem that has all the desired is, xm = wxmtm) and if we set wxm = 1, then we obtain T properties: the familiar linear regression equation of y = wz x + y for • Explicitly identifies input and output noise in the data noiseless input data and noisy output data. The slightly more • Is robust in face of ill-conditioned data general formulation above will be useful in preparing our novel • Detects non-identifiable parameters estimation algorithm. • Produces physically correct parameter estimates The OLS estimate of the regression vector βOLS is T −1 T A key component of our technique is a recently developed (X X) X y, where βOLS is composed of the parameters Bayesian machine learning framework that enables us to recast wz and wx, as discussed in the following paragraph. The (OLS) regression in a more advanced first major issue with OLS regression in high dimensional T −1 algorithm for input noise clean-up and numerical robustness, spaces is that the full rank assumption of (X X) is often especially for very high dimensional estimation problems. A violated due to underconstrained datasets. For more than 500 post-processing step ensures that the rigid body parameters are input dimensions, the matrix inversion required in OLS also physically consistent by nonlinearly projecting the results of becomes rather expensive. Ridge regression can fix the prob- the Bayesian estimate onto the constraints. We will sketch the lem of ill-conditioned matrices by introducing an uncontrolled derivation of this algorithm and compare its results with other amount of bias. There exist also alternative methods to invert approaches in the context of identification of RBD parameters the matrix more efficiently [11], [12], as for instance through on synthetic data and on a robotic vision head. On average, our singular value decomposition factorization (e.g., using the algorithm achieves a 5 to 20% improvement when compared Matlab pinv() function). Nevertheless, all these methods are to other standard techniques for the identification of RBD unable to model noise in input data and require the manual parameters. tuning of meta parameters, which can strongly influence the The remaining paper is structured as follows. First, we quality of the estimation results. Moreover, in the case of RBD motivate the problem of input noise in linear regression parameter estimation, there is no mechanism that ensures the problems and outline possible solutions. Then, based on these parameter estimates are physically consistent. insights, we develop a novel estimation technique that incor- If we examine Eq. (1), we see that if the input data is porates input noise detection and employs Bayesian regular- noiseless (that is, xm = wxmtm), the true regression vector ization methods to ensure robustness of the algorithm for ill- βtrue will be composed of the coefficients wzm/wxm. This is conditioned data. Third, we add a post-processing step to our exactly what the OLS estimate of the regression vector will algorithm that enforces physical correctness of the estimated be for noiseless input data. However, when the input data RBD parameters. Finally, we evaluate our approach for RBD is contaminated with noise, it can be shown that the OLS parameter estimation on synthetic data, on a 7 degree-of- estimate will be βOLS,noise = γβtrue, where 0 < γ < 1, freedom (DOF) robotic vision head and on a 10 DOF robotic where its exact value depends on the amount of input noise. anthropomorphic arm. Thus, OLS regression underestimates the true regression vector βtrue. For the application of RBD parameter estimation, we II. HIGH DIMENSIONAL REGRESSION WITH INPUT NOISE obtain a persistent bias in the model identification such that Before describing our solution to RBD parameter identifica- the model is inaccurate, and this is a problem that cannot be tion, it is useful to examine some of the problems associated fixed by simply adding more data. with traditional estimation methods. For robot systems with Intentionally, the input/output noise model formulation in many DOFs, the linear regression problem for identifying Eq. (1) was chosen such that it coincides with a version of RBD parameters has hundreds of dimensions, i.e., at least 10 a well-known algorithm in the statistical learning community parameters for every DOF. Hence, estimation algorithms need called Factor Analysis [13]. The intuition of this model is

" $

!

 

  

¡ ¢ ¤ ¡

!

  

#

!

% & ' ( ( )

£

¡ ¢



 

¥ ¦ § ¨ ¨ ©

     

, . , -

! !

¢ ¢



 

   

/ !

¦ § ¨ ¨

& ' ( ( +

    

* 

(a) Joint Factor Analysis (JFA) (b) A modified model of JFA for (c) A Bayesian version of JFA efficient estimation

Fig. 1. Graphical Models for Noisy Linear Regression. Random variables are in circular nodes, observed random variables are in double circles and point estimated parameters are in square nodes. d is the total number of input dimensions while N is the total number of samples in the dataset. given in Figure 1(a): every observed input xim and output yi Due to the hidden variables zim and tim, we formulate the is assumed to be generated by a set of hidden variables tim estimation of all open parameters as a maximum likelihood and contaminated with some noise, exactly as given in Eq. (1). estimation problem using the Expectation-Maximization (EM) The graphical model in Figure 1(a) compactly describes the algorithm [15]. For this purpose, the following standard as- full multi-dimensional system: the variables xim, tim, wxm sumptions about the probability distributions of the random and wzm are duplicated d times for the d input dimensions of variables are made: the data—represented by the inner plate containing four nodes. T The outer plate shows that there are N samples of observed yi ∼ Normal(1 zi, ψy) {x,y} data. The goal of learning is to find the parameters zim ∼ Normal(wzmtim, ψzm) wxm and wzm, which can only be achieved by estimating xim ∼ Normal(wxmtim, ψxm) the hidden variables tim and all noise processes as well. For tim ∼ Normal(0, 1) technical reasons, it needs to be assumed that all tim follow T a normal distribution with mean zero and unit variance, i.e., where 1 = [1, 1, ...1] , zi is a d by 1 vector, wz is a d tim ∼ Normal(0, 1), such that all parameters of the model are by 1 vector composed of wzm elements and wx, ψz, and well-constrained. The specific version of factor analysis for ψx are similarly composed of wxm, ψzm and ψzm elements, regression depicted in Figure 1(a) is called joint-space Factor respectively. As Figure 1(b) shows, the regression coefficients Analysis or Joint Factor Analysis (JFA), as both input and wzm are now behind the fan-in to the output yi. This decou- output variables are actually treated the same in the estima- ples the input dimensions and generates a learning algorithm tion process, i.e., only their joint distribution matters. While that operates with O(d) computational complexity, instead of Joint Factor Analysis is well suited for modeling regression O(d3) as in traditional Joint Factor Analysis. models with input noise, it does not handle ill-conditioned The efficient maximum likelihood formulation of Joint data very well and is computationally very expensive for high Factor Analysis is, however, still vulnerable to ill-conditioned dimensions. data. Thus, we introduce a Bayesian layer on top of this model In the following section, we will develop a Bayesian treat- by treating the regression parameters wz and wx probabilis- ment of Joint Factor Analysis that is robust to ill-conditioned tically to protect against overfitting, as shown in Figure 1(c). data, automatically detects non-identifiable parameters and, To do this, we introduce so-called “precision” variables αm due to some post-processing, produces physically consistent over each regression parameter wzm. The same αm is also parameters for RBD—all in a computationally efficient way. used for each wxm. As a result, the regression parameters are now distributed as follows: wzm ∼ Normal(0, 1/αm) and III. BAYESIAN PARAMETER ESTIMATION OF NOISY wxm ∼ Normal(0, 1/αm), where αm takes on a Gamma LINEAR REGRESSION distribution. The rationale of this Bayesian modeling technique is as follows. The key quantity that determines the relevance of Figure 1 demonstrates the successive modifications of the a regression input is the parameter α . A priori, we assume graphical model needed to derive a Bayesian version of Joint m that every w has a mean zero distribution with variance Factor Analysis regression. As a first step, we introduce the zm 1/α . If the value of an α turns out to be very large after hidden variables such that . This trick m m zim zim = wzmtim all model parameters are estimated, then the corresponding was introduced by [14] and allows us to avoid any form of posterior distribution of w must be sharply peaked at zero, matrix inversion in the resulting learning algorithm. With this zm thus giving strong evidence that w = 0 and that the modification, the noisy linear regression model in Eq. (1) is zm input t contributes no information to the regression model. modified to become: m If an input tm contributes no information to the output, d then it is also irrelevant how much it contributes to xim. yi = zim + y (2) Hence, the corresponding inputs xm could be treated as pure =1 mX noise. Coupling both wzm and wxm with the same precision variable αm achieves this effect. In this way, the Bayesian A. Inference of Regression Solution approach automatically detects irrelevant input dimensions and Estimating the rather complex probabilistic Bayesian model regularizes against ill-conditioned datasets. for Joint Factor Analysis reveals distributions and mean values Even with the Bayesian layer added, the entire regression for all hidden variables. One additional step, however, is problem can be treated as an EM-like learning problem [16]. required to infer the final regression parameters, which, in Given the data D = {x ,y }N , our goal is to maximize the i i i=1 our application, are the RBD parameters. For this purpose, we log likelihood log p(y|X), which is often called an “incom- consider the predictive distribution p(yq|xq) for a new noisy plete” log likelihood as all hidden probabilistic variables are test input xq and its unknown output yq. We can calculate marginalized out. However, due to analytical problems, we hyq|xqi, the mean of the distribution associated with p(yq|xq), do not have access to this incomplete likelihood, but rather by conditioning yq on xq and marginalizing out all hidden only to a lower bound of it. This lower bound is based on variables to obtain: an expected value of the so-called “complete” data likelihood, hlog p(y, Z, T, wz, wx, α, |X)i, formulated over all variables p(yq|xq, X, Y) = p(yq, Z, T|xq, X, Y)dZdT of the learning problem, where: ZZ

log p(y, Z, T, wz, wx, α, |X) where X and Y are the data used for training. We can infer q q T q N N d the value of the regression estimate ˆb, since hy |x i = ˆb x . = log p(yi|zi) + log p(zim|wzm,tim) Since an analytical solution for the integral above is only Xi=1 Xi=1 mX=1 possible for the probabilistic Joint Factor Analysis model in N d N d Figure 1(b) and not the full Bayesian treatment, we restrict our + log p(xim|wxm,tim) + log p(tim) computations to the simpler probabilistic model, assuming that Xi=1 mX=1 Xi=1 mX=1 the results will hold in approximation for the Bayesian model. d The resulting regression estimate, given noisy inputs xq and q + log {p(wzm|αm)p(αm)} noisy outputs y , is ˆbnoise: mX=1 T −1 d ˆ ψy1 B −1 −1 T −1 bnoise = T −1 Ψz hWzi A hWxi Ψx (3) + log {p(wxm|αm)p(αm)} ψy − 1 B 1 mX=1 The expectation of this complete data likelihood should be where Ψx is a diagonal matrix with the vector ψx on its diago- taken with respect to the true posterior distribution of all hid- nal (hWxi, hWzi, Ψz are similarly defined diagonal matrices with vectors of w , w and z on their diagonals, respec- z x h xi h zi ψ den variables Q(α, w , w , Z, T). Unfortunately, this is an an- T −1 T −1 alytically intractable expression. Instead, a lower bound can be tively), A = I + Wx Wx Ψx + Wz Wz Ψz and 11T −1 −1 T −1 −1 B = +Ψ − Ψ hW zi A hWzi Ψ . Note formulated using a technique from variational calculus where ψy z z z we make a factorial approximation of the true posterior in that Eq. (3) is similar in form to the regression estimate derived terms of: Q(α, wz, wx, Z, T) = Q(α)Q(wz)Q(wx)Q(Z, T). for Joint Factor Analysis regression (which can be found in While losing a small amount of accuracy, all resulting pos- the Appendix). The major difference is that Eq. (3) contains T −1 terior distributions over hidden variables become analytically an additional term Wz Wz Ψz , due to the introduction of tractable and have the following distributions: hidden variables z . The regression estimate is scaled by an T additional term as well. y |z ∼ Normal(1 zi, ψ ) i i y It is important to note that the regression vector given zim|wzm ∼ Normal(wzmtim, ψzm) by Eq. (3) is for optimal prediction from noisy input data. wzm|αm ∼ Normal(0, 1/αm) However, for system identification in RBD, we are interested

wxm|αm ∼ Normal(0, 1/αm) in obtaining the true regression vector, which is the regression vector that predicts output from noiseless inputs. Thus, the α ∼ Gamma(a , b ) m αm αm result in Eq. (3) is not quite suitable and what we want to As a final result, we now have a mechanism that infers the calculate is the mean of p(yq|tq) where tq are noiseless inputs. significance of each dimension’s contribution to the observed To address this issue, we can take the asymptotic estimate output y and observed inputs x. The resulting EM update of Eq. (3) by letting ψx → 0 and interpret the resulting equations are list in the Appendix. The final regression solution expression to be the true regression vector for noiseless inputs regularizes over the number of retained inputs in the regression (as ψx → 0, the amount of input noise approaches 0). The vector, performing a functionality similar to Automatic Rele- resulting regression vector estimate ˆbtrue becomes: vance Determination (ARD) [17]. Notably, the EM updates T −1 have a computation complexity of O(d) per EM iteration, ˆ ψy1 C −1 T −1 btrue = T −1 Ψz hWzi hWxi (4) where d is the number of input dimensions, instead of the ψy − 1 C 1 O(d3) of Joint Factor Analysis that arises due to a matrix T −1 where C = 11 + Ψ , which is the desired regression inversion. The final result is an efficient Bayesian algorithm ψy z that is robust to high dimensional ill-conditioned noisy data. vector estimate.  B. Post-processing for Physically Consistent Rigid Body Pa- then be simplified in order to obtain: rameters 1 T (y − Xθopt,RBD) (y − Xθopt,RBD) 2  Given a Bayesian estimate of the RBD parameters, we 1 T (7) would like to ensure that the inferred regression vector satisfies = (y − XθBayes) (y − XθBayes) 2 D E the constraints given by positive definite inertia matrices and 1 + (y − Xθ )T X∆θ + ∆θT XT X∆θ the parallel axis theorem. These constraints are shown in Eq. Bayes 2 D E (5) for one DOF. There are 11 parameters for each DOF, Notice that the minimization of the first term which we arrange in an 11-dimensional vector θ consisting T (y − XθBayes) (y − XθBayes) is what our Bayesian of the following 10 parameters: mass, three center of mass DEM-based algorithm does, sinceE it finds the unconstrained coefficients multiplied by the mass and six inertial parameters parameter estimates that minimize the least squares error (cf. [3]). Additionally, we include viscous friction as the using the noisy input and output data. The second term in th parameter. Now, this parameter vector is assumed 11 θ Eq. (7) is equal to 0 (refer to the Appendix for details). Now, to be generated through a nonlinear transformation from a if we re-express the noisy inputs X as Xt + Γ, where Xt 11-dimensional virtual parameter vector θˆ. In essence, these ˆ are noiseless inputs and Γ is the input noise, then we can virtual parameters θ correspond to the square root of the mass, re-write the third term in Eq. (7) as: the true center-of-mass coordinates (i.e., not multiplied by 1 the mass), the six inertial parameters describing the inertia ∆θT XT X∆θ 2 matrix at the DOF’s center of gravity, and the square root 1 T T T T 1 T T of the viscous friction coefficient. The following functions = ∆θ X X∆θ + ∆θ Xt hΓi ∆θ + ∆θ Γ Γ ∆θ show the relationship between virtual parameters θˆ and actual 2 2 1 parameters θ: = ∆θT XT X + ΓT Γ ∆θ 2  (8) 2 2 2 2 2 θ1 = θˆ1,θ2 = θˆ2θˆ1,θ3 = θˆ3θˆ1,θ4 = θˆ4θˆ1,θ11 = θˆ11 since the input noise is modeled with a Gaussian distribution with a mean of zero and some variance Ψ . Hence, as the ˆ2 ˆ2 ˆ2 ˆ2 x θ5 = θ5 + θ4 + θ3 θ1 above equation shows, to minimize the third term in Eq. (7),   2 2 we need to find the parameter estimate that is closest θ6 = θˆ5θˆ6 − θˆ2θˆ3θˆ1,θ7 = θˆ5θˆ7 − θˆ2θˆ4θˆ1 θopt,RBD to θ , under the metric XT X (plus noise variance), as ˆ2 ˆ2 ˆ2 ˆ2 ˆ2 Bayes θ8 = θ6 + θ8 + θ2 + θ4 θ1 possible. In summary, we can see that in order to minimize the   2 least squared error in Eq. (6), we can do so in two independent θ9 = θˆ6θˆ7 + θˆ8θˆ9 − θˆ3θˆ4θˆ1 minimization steps. First, we apply our Bayesian algorithm 2 2 2 2 2 2 θ10 = θˆ7 + θˆ9 + θˆ10 + θˆ2 + θˆ3 θˆ1 (5) (or any other algorithm, for that matter) to come up with   an optimal unconstrained parameter estimate. Then, we find the physically consistent parameter estimates θopt,RBD such These functions encode the parallel axis theorem and some that the error between θopt,RBD and the optimal unconstrained additional constraints, essentially ensuring that mass and vis- parameter estimates is minimized in the sense of Eq. (8). cous friction coefficients remain strictly positive. Given the above formulation, any arbitrary set of virtual parameters gives IV. EVALUATION rise to a physically consistent set of actual parameters for the We evaluated our algorithm on both synthetic data and RBD problem. For a robotic system with s DOFs, Eq. (5) is robotic data for the task of system identification. The goal repeated for each DOF. The result is a regression vector θ, of these evaluations was to determine how well our Bayesian where θm = fm(θˆ) (for m = 1..d where d = 11s and s is the de-noising algorithm performs compared to other standard number of DOFs in the system). techniques for parameter estimation in the presence of noisy Our Bayesian parameter estimation method (as well as any input data. other traditional RBD parameter estimation method) generates First, we start by evaluating our algorithm on a synthetic the parameter vector θ, not the virtual parameters θˆ. We dataset in order to illustrate its effectiveness at de-noising input and output data. Then, we apply the algorithms on a want to find the optimal parameters θopt,RBD that satisfy the physical RBD constraints while minimizing the least squared 7 DOF robotic oculomotor vision head and on a 10 DOF error in the prediction. That is to say, we want to minimize: robotic anthropomorphic arm for the task of RBD parameter estimation. A. Synthetic Data Set 1 T (y − Xθopt,RBD) (y − Xθopt,RBD) (6) We synthesized random input training data consisting of 10 2  relevant dimensions and 90 irrelevant and redundant dimen- sions. The first 10 input dimensions were drawn from a multi- We can re-express θopt,RBD as (θBayes − ∆θ). Eq. (6) can dimensional Gaussian distribution with a random covariance OLS STEP PLS LASSO JFA BAYES under the control of a manually set tuning parameter; our probabilistic treatment of Joint Factor Analysis in Figure 1(b); 0.25 and our Bayesian de-noising algorithm shown in Figure 1(c). The Bayesian de-noising algorithm had an improvement of 0.2 10 to 20% compared to other algorithms for strongly noisy 0.15 input data and an improvement of 7 to 50% for less noisy input data, as the black bars in Figures 2(a) and 2(b) illustrate.

nMSE 0.1 One interesting observation is that for the case where the 90 input dimensions are all irrelevant, the Bayesian de-noising 0.05 algorithm does not give such a significant reduction in error as in the other 3 scenarios. This can be explained by the 0 r=90, u=0 r=0, u=90 r=30, u=60 r=60, u=30 fact that the other algorithms cannot handle data containing redundancy, and that the true power of our algorithm lies in (a) Training & Test NMSE for SNR = 2, SNR = 5 x y its ability to identify relevant dimensions in the presence of 0.25 redundant and irrelevant data. 0.2 B. Robotic Oculomotor Vision Head

0.15 Next, we move on to a 7 DOF robotic vision head manufactured by Sarcos as

nMSE 0.1 shown in Figure 3, possessing 3 DOFs in the neck and 2 DOFs for each eye. With 0.05 11 features per DOF, this gives a total of 77 features. This kinematic structure 0 r=90, u=0 r=0, u=90 r=30, u=60 r=60, u=30 of robotic systems always creates non- identifiable parameters and thus, redun- = 5 = 5 (b) Training & Test NMSE for SNRx , SNRy dancies [3]. The robot is controlled at 420 Fig. 2. Average normalized mean squared errors on noiseless (clean) test Hz with a VxWorks real-time operating Fig. 3. Sarcos Ocu- data and noisy test data for a 100 dimensional dataset with 10 relevant input system running out of a VME bus. We lomotor Vision Head dimensions and various combinations of redundant input dimensions r and collected about 500,000 data points from the robotic system irrelevant input dimensions u, averaged over 10 trials, for different levels of noisy data while it performed sinusoidal movements with varying fre- quencies and phase offsets in all DOFs. We compared our Bayesian algorithm with 3 other tech- matrix. The output data was generated using an ordered niques for parameter estimation on the robot data. The first T regression vector btrue = [1, 2, ..., 10] . A signal-to-noise technique consisted of ridge regression using a hand-tuned ratio (SNR) of 5 was added to the outputs. Then, we added regularization parameter with nonlinear gradient descent per- Gaussian noise with varying SNRs (a SNR of 2 for strongly formed on the virtual parameters of the system. The second noisy input data and a SNR of 5 for less noisy input data) algorithm was a version of LASSO regression that had the to the relevant 10 input dimensions. A varying number of additional step of projecting the resulting parameter values redundant data vectors was added to the input data and these onto the constraint space in order to produce physically were generated from random convex combinations of the 10 consistent RBD parameters. Finally, the last algorithm was noisy relevant data vectors. Finally, we added irrelevant data a version of with the additional projection columns, drawn from a Normal(0, 1) distribution, until a total step. All four algorithms produced physically consistent RBD of 100 input dimension were attained. The result was an parameters. Note that the other algorithms used in the synthetic input training dataset that contained irrelevant and redundant dataset like PLS and JFA were not applied, since they fail dimensions. Test data was created using the same method to explicitly eliminated irrelevant input features and do not outlined above, except that input and output data were both perform any form of reasonable parameter identification. noiseless. A second test dataset consisting of noisy input and For evaluation, we implemented a computed torque control output data, possessing the same noise characteristics as the law on the robot, using estimated parameters from each training dataset, was also generated. technique. Results are quantified as the root mean squared We compared our Bayesian de-noising algorithm with the errors in position tracking, velocity tracking and the root following methods: OLS regression; stepwise regression [18], mean squared feedback command. Table I shows these results which tends to be inconsistent in the presence of collinear averaged over all 7 DOFs. The Bayesian parameter estimation inputs [19]; Partial Least Squares regression (PLS) [20], a approach performed around 10 to 20% better than the ridge slightly heuristic but empirically successful regression method regression with gradient descent approach, thus validating the for high dimensional data; LASSO regression [21], which effectiveness of our methods. LASSO regression performed gives sparse solutions by shrinking certain coefficients to 0 worse than ridge regression with gradient descent. As well, Position (radians) Velocity (radians/sec) Feedback Command (Newton-meter) Ridge Regression 0.0291 0.2465 0.3969 Bayesian De-noising 0.0243 0.2189 0.3292 LASSO regression 0.0308 0.2517 0.4272 Stepwise regression FAILURE FAILURE FAILURE TABLE I ROOTMEANSQUAREDERRORSFORPOSITION (IN RADIANS), VELOCITY (IN RADIANS/SEC) AND FEEDBACK COMMAND (IN NEWTON-METERS) FOR THE SARCOSROBOTICVISIONHEAD. ALGORITHMSEVALUATEDINCLUDERIDGEREGRESSIONWITHNONLINEAR GRADIENT DESCENT, OUR BAYESIAN DE-NOISING ALGORITHM, LASSO REGRESSIONWITHTHEPROJECTIONSTEP, ANDSTEPWISEREGRESSIONWITHTHEPROJECTIONSTEP. STANDARD DEVIATIONS ARE NEGLIGIBLE AND THUS OMITTED. Position (radians) Velocity (radians/sec) Feedback Command (Newton-meter) Ridge Regression 0.0210 0.1119 0.5839 Bayesian De-noising 0.0201 0.0930 0.5297 LASSO regression FAILURE FAILURE FAILURE Stepwise regression FAILURE FAILURE FAILURE TABLE II ROOTMEANSQUAREDERRORSFORPOSITION (IN RADIANS), VELOCITY (IN RADIANS/SEC) AND FEEDBACK COMMAND (IN NEWTON-METERS) FOR THE SARCOSROBOTICANTHROPOMORPHICARM. ALGORITHMSEVALUATEDINCLUDERIDGEREGRESSIONWITHNONLINEAR GRADIENT DESCENT, OUR BAYESIAN DE-NOISING ALGORITHM, LASSO REGRESSIONWITHTHEPROJECTIONSTEP, ANDSTEPWISEREGRESSIONWITHTHEPROJECTIONSTEP. STANDARD DEVIATIONS ARE NEGLIGIBLE AND THUS OMITTED. stepwise regression produced RBD parameters that were so the robotic system, we additionally introduce a post-processing physically off that they were impossible to run on the robotic step that takes the Bayesian estimate and projects the solution head. This can be explained by stepwise regression’s failure to onto a set of constraints, derived from the parallel axis identify the relevant features in the dataset, resulting in RBD theorem and from the need to ensure positive values for certain parameter values that were plain wrong. parameters. We demonstrate the efficiency of the algorithm by applying it on a synthetic dataset, a 7 DOF robotic vision head C. Robotic Anthropomorphic Arm and a 10 DOF robotic anthropomorphic arm. Our algorithm We also evaluated the parameter esti- successfully identified the system parameters from 5 to 20% mation algorithms on a 10 DOF robotic higher accuracy than alternative methods, thus proving to be anthropomorphic arm made by Sarcos, a competitive alternative for parameter estimation on complex shown in Fig. 4. With 3 DOFs in the high degree-of-freedom robotic systems. shoulder, 1 DOF in the elbow, 3 DOFs in ACKNOWLEDGMENTS the wrist and 3 DOFs in the fingers, this This research was supported in part by National Science Founda- gives a total of 110 features. We collected tion grants ECS- , IIS- , IIS- , ECS- , about a million data points from the 0325383 0312802 0082995 0326095 ANI-0224419, a NASA grant AC#98 − 516, an AFOSR grant robotic arm over a period of 40 minutes, Fig. 4. Sarcos An- thropomorphic Arm on Intelligent Control, the ERATO Kawato Dynamic Brain Project gathering data at a rate of 480 samples funded by the Japanese Science and Technology Agency, and the per second. During this time period, the arm performed sinu- ATR Computational Neuroscience Laboratories. soidal movements with varying frequencies and phase offsets in all DOFs. We downsampled the data collected to a more REFERENCES manageable size of 500000 and evaluated the algorithms in [1] L. Sciavicco and B. Siciliano. Modeling and control of robot manipu- a similar approach as for the robotic vision head. Table II lators. MacGraw-Hill, 1996. displays the results averaged over all 10 DOFs. The Bayesian [2] J. Nakanishi, R. Cory, M. Mistry, J. Peters, and S. Schaal. Comparative experiments on task space control with redundancy resolution. In Pro- parameter estimation approach performed around 5 to 17% ceedings of the 2005 IEEE/RSJ International Conference on Intelligent better than the other techniques. LASSO regression failed, Robots and Systems, pages 1575–1582. IEEE, 2005. due to its over-aggressive clipping of relevant dimensions, [3] C. H. An, C. G. Atkeson, and J. M. Hollerbach. Model-based control of a robot manipulator. MIT Press, 1988. and stepwise regression produced RBD parameters that were [4] G. H. Golub and C. Van Loan. Matrix Computations. John Hopkins impossible to run on the robotic arm. University Press, 198. [5] S. Van Huffel and J. Vanderwalle. The total least squares problem: Computational aspects and analysis. Society for Industrial and Applied V. CONCLUSION Mathematics, 1991. We derived a Bayesian version of rigid body dynamics pa- [6] J. M. Hollerbach and C. W. Wampler. The calibration index and the role of input noise in robot calibration. In G. Giralt and G. Hirzinger, rameter estimation based on Joint Factor Analysis, a classical editors, Robotics Research: The Seventh International Symposium, pages machine learning technique. The Bayesian parameter estima- 558–568. Springer, 1996. tion algorithm is robust to high dimensional ill-conditioned [7] Y. N. Rao and J.C. Principe. Efficient total least squares method for system modeling using minor component analysis. In Proceedings data contaminated with noisy input and noisy output data. In of International Workshop on Neural Networks for Signal Processing, order to produce physically coherent rigid body parameters of pages 259–268. IEEE, 2002. [8] S. C. Douglas. Analysis of an anti-hebbian adaptive FIR filtering The , Σ, of the joint posterior distribution of Z and algorithm. IEEE Transactions on Circuits and Systems-II: Analog and Σzz Σzt Digital Signal Processing, 43(11), 1996. T is , where: » Σtz Σtt – [9] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression diagnostics: T Identifying influential data and sources of collinearity. Wiley, 1980. M11 M −1 −1 T [10] Y. N. Rao, D. Erdogmus, G. Y. Rao, and J.C. Principe. Fast error whiten- Σzz = M − T , Σzt = −Σzz hWzi Ψz K , Σtz =Σzt ψy + 1 M1 ing algorithms for system identification and control. In Proceedings −1 −1 T −1 −1 −1 of International Workshop on Neural Networks for Signal Processing, Σtt = K + K hWzi Ψz ΣzzΨz hWzi K pages 309–318. IEEE, 2003. T −1 T −1 [11] V. Strassen. Gaussian elimination is not optimal. Num Mathematik, K = I + Wx Wx Ψx + Wz Wz Ψz D E D E −1 13:354–356, 1969. T −1 −1 [12] T. J. Hastie and R. J. Tibshirani. Generalized additive models. Num- M = Ψz + hWzi I + Wx Wx Ψx + (ΣWz )mm Ψz ber 43 in Monographs on and Applied Probability. Chapman T “ D E ” and Hall, 1990. hWzi [13] W.F. Massey. Principal component regression in exploratory statistical and where hWxi is a diagonal d by d matrix with hwxi along its research. Journal of the American Statistical Association, 60:234–246, diagonal. Similarly, W , Ψ , Ψ are d by d diagonal matrices 1965. h zi x z with diagonal vectors of wz , ψx and ψz. The E-step updates for [14] A. D’Souza, S. Vijayakumar, and S. Schaal. The bayesian backfitting h i relevance vector machine. In Proceedings of the 21st International Z and T are then: Conference on Machine Learning. ACM Press, 2004. yi T T −1 hzii = 1 Σzz + xi hWxi Ψx Σtz [15] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from ψy incomplete data via the EM algorithm. Journal of Royal Statistical yi T −1 −1 T −1 Society. Series B, 39(1):1–38, 1977. htii = 1 Σzz hWzi Ψz K + xi hWxi Ψx Σtt ψy [16] Z. Ghahramani and M.J. Beal. Graphical models and variational 2 2 methods. In D. Saad and M. Opper, editors, Advanced Mean Field σz = diag(Σzz), σt = diag(Σtt), cov(z,t) = diag(Σzt) Methods - Theory and Practice. MIT Press, 2000. [17] R.M. Neal. Bayesian learning for neural networks. PhD thesis, Dept. B. Inference of Regression Estimate for Joint Factor Analysis of Computer Science, University of Toronto, 1994. regression [18] N. R. Draper and H. Smith. Applied . Wiley, 1981. The regression esimate for Joint-Space Factor Analysis regression [19] S. Derksen and H.J. Keselman. Backward, forward and stepwise is as follows: automated subset selection algorithms: Frequency of obtaining authentic −1 and noise variables. British Journal of Mathematical and Statistical T −1 T −1 bJFA = Wz I + Wx WxΨx Wx Wx Ψx (9) Psychology, 45:265–282, 1992. “ ” [20] H. Wold. Soft modeling by latent variables: The nonlinear iterative C. Minimizing Least Squared Error partial least squares approach. In J. Gani, editor, Perspectives in probability and statistics, papers in honor of M. S. Bartlett. Academic We want to minimize the cost function J where J = 1 T Press, 1975. 2 (y − Xθ) (y − Xθ). Now, let us differentiate J with respect to [21] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal θ to get: of Royal Statistical Society, Series B, 58(1):267–288, 1996. ∂J = − (y − Xθ)T X (10) ∂θ ∂J APPENDIX By setting ∂θ to 0 and solving for θ, we will get the solution to the minimization cost problem. We can see that when we have the A. EM-update Equations optimal solution for θ that minimizes J, then (y − Xθ)T X = 0. When we are dealing with noisy instead of noiseless inputs, then we We can then derive the following EM updates using standard can resort to our EM-based de-noising algorithm that attempts to find manipulations of normal distributions: a θ that is optimal.

M-step : N 1 2 T T ψy = y − 21yi hzii + 1 ziz 1 N i i Xi=1 “ D E ” N 1 2 2 2 ψzm = z − 2 hwzmi hzimtimi + w t N im zm im Xi=1 `˙ ¸ ˙ ¸˙ ¸´ N 1 2 2 2 ψxm = x − 2 hwxmi htimi xim + w t N im xm im Xi=1 ` ˙ ¸˙ ¸´ E-step : 2 N 2 1 σwzm σwzm = 1 N 2 , hwzmi = hzimtimi ht i + hαmi ψzm ψzm i=1 im Xi=1 P 2 N 2 1 σwxm σwxm = 1 N 2 , hwxmi = xim htimi ht i + hαmi ψxm ψxm i=1 im Xi=1 P 2 2 wzm + wxm aˆα = aα + 1, ˆbα = bα + m m0 m m0 ˙ ¸ 2 ˙ ¸