DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ENGINEERING PHYSICS AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019

Deep for Adaptive Human Robotic Collaboration

JOHAN FREDIN HASLUM

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Deep Reinforcement Learning for Adaptive Human Robotic Collaboration

JOHAN FREDIN HASLUM

Master in Computer Science Date: April 11, 2019 Supervisor: Mårten Björkman Examiner: Olov Engvall School of Electrical Engineering and Computer Science

iii

Abstract

Robots are expected to become an increasingly common part of most humans everyday lives. As the number of increase, so will also the number of human- interactions. For these interactions to be valuable and intuitive, new advanced robotic control policies will be necessary. Current policies often lack flexibility, rely heavily on human expertise and are often programmed for very specific use cases.

A promising alternative is the use of Deep Reinforcement Learning, a family of algorithms that learn by trial and error. Following the recent success of Reinforcement Learning (RL) to areas previously considered too complex, RL has emerged as a possible method to learn Robotic Control Policies. This thesis explores the possibility of using Deep Re- inforcement Learning (DRL) as a method to learn Robotic Control Policies for Human Robotic Collaboration (HRC). Specifically, it will evaluate if DRL algorithms can be used to train a robot to collaboratively balance a ball with a human along a predetermined path on a table.

To evaluate if it is possible several experiments are performed in a sim- ulator, where two robots jointly balance a ball, one emulating a human and one relying on the policy from the DRL algorithm. The experi- ments performed suggest that DRL can be used to enable HRC which perform equivalently or better than an emulated human performing the task alone. Further, the experiments indicate that less skilled hu- man collaborators performance can be improved by cooperating with a DRL trained robot. iv

Sammanfattning

Närvaron av robotar förväntas bli en allt vanligare del av de flesta människors vardagsliv. När antalet robotar ökar, så ökar även antalet människa-robot-interaktioner. För att dessa interaktioner ska vara an- vändbara och intuitiva, kommer nya avancerade robotkontrollstrate- gier att vara nödvändiga. Nuvarande strategier saknar ofta flexibilitet, är mycket beroende av mänsklig kunskap och är ofta programmerade för mycket specifika användningsfall.

Ett lovande alternativ är användningen av Deep Reinforcement Le- arning, en familj av algoritmer som lär sig genom att testa sig fram, likt en människa. Efter den senaste tidens framgångar inom Reinfor- cement Learning (RL) vilket applicerats på områden som tidigare an- setts vara för komplexa har RL nu blivit ett möjlig alternativ till mer etablerade metoder för att lära sig kontrollstrategier för robotar. Denna uppsats undersöker möjligheten att använda Deep Reinforcement Le- arning (DRL) som metod för att lära sig sådana kontrollstrategier för människa-robot-samarbeten. Specifikt kommer den att utvärdera om DRL-algoritmer kan användas för att träna en robot och en människa att tillsammans balansera en boll längs en förutbestämd bana på ett bord.

För att utvärdera om det är möjligt utförs flera experiment i en simu- lator, där två robotar gemensamt balanserar en boll, en simulerar en människa och den andra en robot som kontrolleras med hjälp av DRL- algoritmen. De utförda experimenten tyder på att DRL kan användas för att möjliggöra människa-robot-samarbeten som utförs lika bra el- ler bättre än en simulerad människa som utför uppgiften ensam. Vi- dare indikerar experimenten att prestationer med mindre kompetenta mänskliga deltagare kan förbättras genom att samarbeta med en DRL- algoritm-kontrollerad robot. Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Problem Specification ...... 2 1.2.1 Research Questions ...... 2 1.2.2 Scope & Delimitation ...... 3 1.3 Ethics and Societal Impact ...... 3

2 Background 5 2.1 Reinforcement Learning ...... 5 2.1.1 Natures Way of Learning ...... 5 2.1.2 Formulation ...... 6 2.1.3 Reinforcement Learning Tools and Ideas . . . . . 6 2.2 Deep Learning in Reinforcement Learning ...... 10 2.3 Deep Reinforcement Learning Algorithms ...... 11 2.3.1 Deep Q-Learning ...... 12 2.3.2 Deep Deterministic Policy Gradient ...... 12 2.3.3 A3C ...... 14 2.3.4 TRPO ...... 15 2.3.5 PPO ...... 16

3 Related Work 18 3.1 Reinforcement Learning for Robotic Control ...... 18 3.1.1 End-to-End Visuomotor Policies ...... 18 3.1.2 Training in Simulation ...... 19 3.1.3 Imitation Learning ...... 20 3.1.4 Auto encoder Learning ...... 21 3.1.5 Domain Randomization and Large Scale Data Col- lection ...... 22

v vi CONTENTS

3.1.6 Prioritized Experience Replay ...... 23 3.2 Human-Robot Collaboration ...... 24

4 Method 27 4.1 Problem ...... 28 4.2 Implementation ...... 29 4.2.1 Physics Engine ...... 29 4.2.2 Robot Environment ...... 29 4.2.3 Human Movement Simulation ...... 30 4.2.4 Reinforcement Learning Algorithm ...... 31 4.2.5 Observation Space ...... 32 4.2.6 Design choices ...... 33 4.3 Experimental Setup ...... 35 4.3.1 Common Details ...... 35 4.3.2 Collaborative Balancing with Varying Level of Skilled Human Partner ...... 36 4.3.3 Balancing with DRL Collaborator ...... 38 4.3.4 Balancing with More Information ...... 38

5 Experiments 39 5.1 Results ...... 39 5.1.1 Analysis ...... 39 5.1.2 Performance of Human Collaborator Acting Alone 41 5.1.3 Performance of Robot Collaborator Acting Alone 41 5.1.4 Performance of Human-Robot Collaborator . . . . 46 5.1.5 Performance of Robot-Robot Collaborator . . . . . 56 5.1.6 Performance of Human-Robot Collaborator with more Information ...... 57

6 Discussion 60 6.1 Conclusions ...... 60 6.1.1 Research Questions ...... 61 6.1.2 Unanswered Questions ...... 62 6.1.3 Limitations and Improvements ...... 63

7 Conclusion 64

Bibliography 65

8 Appendix 70 Chapter 1

Introduction

1.1 Motivation

As robots become increasingly common in society, the number of in- teractions between humans and robots will most likely increase signif- icantly. The interplay between these two will become a part of every- day life for a lot of people. In order for these interactions to become useful, they need to feel natural for the individual involved. This does not only require the robot to interact in a way that feels customary for humans in general, it also has to adapt on a person to person basis.

Human collaboration involves complex organization and communi- cation that result in an outcome that is greater than the sum of the individual capabilities. This advanced interplay between two or more individuals, can often be done without much effort and in silence. For example, carrying a table together can easily be done, just relying on the haptic feedback felt in the collaborators hands. This ability to adapt in a way that feels ordinary to humans have not been transferred to robots. Successfully equipping robots with this capability, will likely be essential in the future.

More specifically current research within the area of human-robot col- laboration is focused on industry applications. An example of this is the use of human-robot pair in car manufacturing. By combining the skillfulness of humans and their ability to learn quickly, with the cost efficiency and physical strength of robots, the efficiency is increased and the operational cost reduced [19].

1 2 CHAPTER 1. INTRODUCTION

The ability to learn how to best collaborate with humans is an active area of research, although several different approaches have been sug- gested, few have shown great promise [9]. Specific problems have been solved such as jointly lifting an object, however the proposed method rely heavily on human expert knowledge to implement a func- tioning control system. This is not only true for collaboration tasks, but rather all robotic control problems. Since human expertise is costly and can be a scarce resource, the possibility of teaching robots how to inter- act with it’s surroundings using laymen or no human intervention at all would enable cheaper and more accessible robotic control systems.

The possibility of teaching robots how to behave through other means than human crafted control policies is one that is researched exten- sively. One promising such field is Deep Reinforcement Learning (DRL). DRL is a set of self learning algorithms that relies on interaction and examples, thus requiring no expert knowledge. A lot of work is cur- rently focused on the applicability of DRL to robotic control and the vision of many researchers is for robots to learn in a similar way as humans. Namely by learning to recognize visual and other sensory inputs and learn how to map these inputs to appropriate actions.

The application of DRL to robotic control has shown promise, although it is still in early stages of development. The goal of this thesis is to evaluate the applicability of DRL to human-robot collaborations. Fur- ther it also evaluates the importance of different sensory modalities on the performance of the algorithm.

1.2 Problem Specification

1.2.1 Research Questions The questions that this thesis attempts to explore and answer are the following:

What is a suitable Deep Reinforcement Learning framework for learning adap- tive human-robot interactions, such that robots can learn to collaborate with humans, in what the human perceive as a natural way? CHAPTER 1. INTRODUCTION 3

What impact does the available sensor modalities have on the performance off such a framework?

1.2.2 Scope & Delimitation The scope of this thesis is to evaluate the applicability of DRL algo- rithms on human robot collaborative problems. This is evaluated us- ing a toy problem, which represents the challenges involved with Hu- man Robotic Interactions.

This toy problem is an adaptation of previous experiments used in other research projects involving human-robot and human-human col- laboration. The goal is for a human and robot to collaboratively bal- ance a ball on a table. The ball is to follow a predetermined path. The problem is designed in such a way that only a sub optimal solution can be achieved without collaboration. An in depth description of the problem can be found in the experiments chapter.

The thesis evaluates the applicability of suitable DRL algorithms for creating a control policy that can solve the problem, both with and without human collaborators. Further, the impact that different sen- sor modalities have on the learning speed and the performance on the final policy will be investigated.

These experiments are run in simulations and not performed on a physical robot.

1.3 Ethics and Societal Impact

The question of ethics in regard to intelligent robots and artificial in- telligence in general is multidimensional. However, the focus is of- ten shifted towards areas that have been depicted in popular culture movies and books, such as military use of AI. While military applica- tions of AI have direct consequences which poses a number of ethi- cal dilemmas, there are more subtle issues with small but possible far reaching consequences. This is especially true in the area of human- robotic collaboration, because of the ethical dilemmas which can arise 4 CHAPTER 1. INTRODUCTION

with human involvement. Psychological effects such as emotional at- tachment and possibly replacement of real human-human interactions can have long lasting effects that have not yet been studied fully [35]. Further, a move towards a larger robot presence in everyday life can result in physical danger for humans in proximity of malfunctioning robots.

Beyond the direct ethical implications of human involvement, improved robotic control could on its own contribute to further automation. Au- tomation has historically proven to impact societal development sub- stantially, foremost in the form of shifting the standings in labour mar- kets. In regards to the number of unskilled labourers, when human workers can be replaced by robots and possibly the following global economic effects if off-shore jobs can be replaced by local robot labour. This might not only have economic effect, but also environmental, when factories and their associated pollution’s are moved.

Regardless of the problems that might be solved or created by better robotic control algorithms and human robotic collaboration, legislative actions need to be implemented to control the development, such that the improvements help and benefit humans not only now, but also in the future in a sustainable way. Chapter 2

Background

This chapter covers the theoretical background necessary for under- standing the contributions of this thesis. The focus of this chapter is Reinforcement Learning and Deep Reinforcement Learning.

2.1 Reinforcement Learning

Reinforcement Learning is an area of that can be considered both a set of problems and solution methods to these prob- lems. It is concerned with finding the best possible behaviour strategy for an agent interacting with an environment. The underlying idea is that similarly to how humans and other animals learn by trial-and- error, so should also software agents be able to learn. The ideas and in- formation provided in this sub-chapter is found in Sutton and Bartons book Reinforcement Learning: An Introduction [32] unless stated other- wise.

2.1.1 Natures Way of Learning Reinforcement Learning was originally inspired by behavioural psy- chology. Similarly to how humans are taught that some actions are good and others bad by either reward or punishment, this class of al- gorithms reinforce good actions while discouraging bad. For example, teaching a dog to sit by rewarding it with candy if successful, or how a kid learns not to touch the stove after burning his or her hand. This trial-and-error approach to learning is simulated by giving a numerical reward as feedback on the performance of an algorithm. Thus, based

5 6 CHAPTER 2. BACKGROUND

Agent

New state st+1 Reward rt Action at

Environment

Figure 2.1: Reinforcement Learning flow chart on the result signal, a learning algorithm can evaluate and update its parameters based on how good or bad a set of actions were.

2.1.2 Formulation Reinforcement Learning problems are often stated as follows. An agent y interacts with an environment for a set of T discrete time steps E t [0,T]. Each time step the agent y is given information about the cur- 2 rent state of the environment st and chooses an action at to take. Given the environment dynamics, current state and the action selected, the environment will transition to the next state st+1. When the agent transitions into a new state, it receives a reward rt R. The goal of 2 the agent is to maximize the expected cumulative reward R.

T t R = rt (2.1) t=0 X

The [0, 1] introduced in equation 2.1 is what is called a discount 2 factor. When T the expected cumulative reward can become un- ! 1 bounded and by introducing this problem is avoided. Further it also emphasizes the importance of short term reward relative to long term.

2.1.3 Reinforcement Learning Tools and Ideas As mentioned above the goal is to maximize the expected cumulative reward and to do this several different approaches have been sug- CHAPTER 2. BACKGROUND 7

gested. The number of RL algorithms available is too large for the scope of this thesis. However, most if not all rely on a few simple intu- itive concepts and can be classified to belong to different classes. These will be introduced briefly below.

Policy

The agent interacts with the environment by choosing an action at at each time step t. The goal of the policy is to find the best possible action at for the agent to take at each time step. What can be considered the best action is highly dependent on the current state. Therefore the policy is a mapping from state to action.

a ⇡(a s ) (2.2) t ⇠ | t An optimal policy ⇡⇤ may also be introduced. The ⇡⇤ is defined to select the best possible action at each time step, such that the expected cumulative reward is maximized:

E⇡ [Rt] E 0 [Rt] for all ⇡0 (2.3) ⇤ ⇡ State Value Function The state value function (STV) evaluates the expected cumulative re- wards following the current policy from the time state it is in. In other words, if the agent follows the policy from here on out, the expected cumulative sum of rewards is expected to be V⇡(st).

V⇡(st)=E⇡[Rt st] (2.4) |

State-Action Value Function Similarly to the state value function the state-action value function evaluates the expected future reward, given the current state st. How- ever, it also accounts for how the expected return is affected by which action at is chosen.

Q⇡(st,at)=E⇡[Rt st,at] (2.5) | 8 CHAPTER 2. BACKGROUND

Similarly to the optimal policy an optimal state-action value function can be defined as the maximum reward achievable by following any policy ⇡.

Q⇤(s, a)=maxE[Rt st = s, at = a, ⇡] (2.6) ⇡ | The optimal state-action value function also follows the Bellman equa- tion.

Q⇤(s, a)=Es0 [r + maxQ⇤(st+1 = s0,at+1 = a0) s, a] (2.7) a0 | The state-action value function evaluates what the future expected cu- mulative sum of rewards will be if one chooses a particular action and then follows the current policy.

Advantage function The advantage function is a combination of the two functions descri- bed above. The function calculates the advantage of taking action at over following the action suggested by the STV’s policy ⇡. This is done by calculating the difference in future expected discounted cumulative sum of rewards depending on what action at is taken, given the state st.

A (s ,a)=Q (s ,a) V (s ) (2.8) ⇡ t t ⇡ t t ⇡ t

Model-Based Reinforcement Learning algorithms are often classified as either model-based or model-free. The difference between these two is the fact that a model-based method tries to build a model of how the en- vironment behaves. This is often formalized in terms of the transition probability between states, given an action. A model based strategy can use a planning approach. Thus, it can search for a good solu- tion, selecting the best actions several time steps in advance. However, modeling the environment is often very hard due too stochasticity. CHAPTER 2. BACKGROUND 9

Model-Free Contrary to model based methods a model-free approach does not build a model of the transition probability and does not plan ahead in a search fashion. Rather, a good model-free approach learns what the best possible action is for the current step only. This is a clear draw- back, since the ability to plan might be very useful. However, the lack of model enables the use of non-temporally correlated samples, some- thing that will be discussed later in this thesis.

Replay Buffer Parts of the formalized RL problem described in section 2.1.2 is often saved into what is called a Replay Buffer. A Replay Buffer is used to store experiences gained by the agent in memory, such that they can be re-experienced and thus used for learning at a later point. The data that can be stored in a Replay Buffer can vary, but commonly the following tuple is stored: (st,at,r,st+1). These four data points are the previous state, the action performed, the resulting reward and the resulting state. When a model-free algorithm is used the replay buffer is a very important tool, that enables learning from non-temporally correlated data, thus allowing the reuse of experiences.

Temporal Difference Error Signal

T T T t t t R = rt = r + rt = r + r + rt (2.9) t=0 t=1 t=2 X X X

As previously stated the ultimate goal for the agent is to maximize the cumulative reward. This goal is described mathematically in equation 2.1 and as seen in equation 2.9 it can also be expanded. Although both equations are equivalent, the expanded notation of the goal captures the essential idea in all the concepts described above. One can see it as dividing the cumulative reward into two parts, the reward achieved up until time point t and that which is expected, as represented by the reward before the summation and in the summation. In both the state-value function and state-action value function the future reward 10 CHAPTER 2. BACKGROUND

(which can be seen as captured in the summation) is estimated and maximized. Each time step an approximation is calculated and this approximation is both used when applying the algorithm and when training it. The discrepancy between the estimated future reward in one time step and that which is achieved is often called the temporal difference error and is used in most RL algorithms to evaluate and im- prove performance.

For example, if a state-value function is used, the value function V (st) in time step t should be equal to rt + V (st+1) in time step t +1, if the value function can predict the future discounted reward perfectly. This is often not the case, rather there is a difference between the two which is called the temporal difference error e(t)=(r + V (s )) t t+1 V (st). This error is often used to train the model used to approximate the state-value function or another expected discounted future reward function.

2.2 Deep Learning in Reinforcement Learning

The basic idea behind RL is as previously mentioned to find the best possible mapping between state and action to maximize the cumula- tive reward. Historically this has been done using a number of differ- ent approaches.

One example is Q-Learning which relies on a state-action value func- tion Q⇡(s, a) that estimates the quality of state-action combinations such that the agent can choose the pair with highest expected reward. The core of the algorithm is value iteration which is used to update the estimations. The state-action value function have historically been represented by arrays, such that a state maps to an action and if the state have not been visited before there is no information as to what action might be suitable. Further whenever a state or action space is large, the arrays necessary become impossible to store in memory and traverse. Some kind of approximation is therefore necessary.

Linear models have been successfully used as approximations of value function, policies etc. Not only can such models easily fit in memory, CHAPTER 2. BACKGROUND 11

they can also generalize. This means that if an agent is in a state it has not previously visited the linear model can extrapolate what might be an appropriate action, based on knowledge about similar states. Both array based methods and methods based on linear models provide a convergence guarantee.

Although linear models provide some generalizability and are better suited for large state and action spaces than array based method, the linearity can also be a problem. Many problems might be highly non- linear and complex. In these cases a linear model can over-generalize and create biases in the solution.

To combine the complex policies that array based methods can rep- resent with the generalizeability of linear models, as well as linear models ability to tackle large state and action spaces, non-linear ap- proximators have been suggested as a suitable approach.

Following the success of deep learning in other Machine Learning ar- eas such as image classification and pattern recognition [18], attention have once again been focused on non-linear models as approximators for RL models. Although the lack of convergence guaranties is still a problem, the training of Deep Neural Networks as non-linear approx- imators have shown promise.

The spike of interest in using Deep Neural Networks in Reinforcement Learning largely followed the success of Mnih et al. [26]. In their pa- per Human-level control through deep reinforcement learning they success- fully taught an agent to play a variety of Atari games completely from scratch. The agent was not only able to achieve scores on pair with hu- man players in most games, but was also able to outperform humans in a range of games. Following their success a number of other Deep Reinforcement Learning algorithms have been suggested, a number of these will be presented in the following section.

2.3 Deep Reinforcement Learning Algorithms

The number of RL algorithms that have been suggested is too large for the scope of this project. However, a few of the most successful and rel- 12 CHAPTER 2. BACKGROUND

evant ones will be presented. The goal of all the algorithms described later in this section is to maximize the future cumulative reward equa- tion 2.1. The general idea is for an agent to gain experience by interact- ing with the environment it is in. Then using the experiences gained to train and improve one or several deep neural networks (DNN) acting as function approximators. The algorithms described below differ in many aspects, such as how the agent explores the environment, what loss functions the DNN optimizes, how often the DNN’s are updated, if the algorithm is meant for a discrete or continuous action space etc.

2.3.1 Deep Q-Learning The algorithm DQN suggested by Mnih et al. [26] in the paper Human- level control through deep reinforcement learning assumes a discrete action space = 1,...,k and a continuous multidimensional state space A { } Rd. The basis of the algorithm is a State-Action Value Function S 2 (2.1.3) that is approximated using a Deep Neural Network (DNN).

The State-Action Value function approximates the expected future re- ward based on what action is taken in the current state. An action is selected in each time step using what is called an ✏-greedy approach. This means that with a probability of 1 ✏ it selects the action with the maximum expected future reward and with a probability of ✏ selects an uniformly random action out of the k available. ✏ is a hyper-parameter which controls the ratio between the agents tendency to explore vs. exploit.

In each time step the agent interacts with the environment and a tu- ple of data is recorded and stored in a replay buffer. Every few time steps the replay buffer is sampled and a mini-batch of uncorrelated samples is gathered. This mini-batch is then used to train the DNN. The DNN is trained by minimizing the square of the temporal differ- ence error, such that the State-Action Value function is updated using gradient descent to be more and more accurate.

2.3.2 Deep Deterministic Policy Gradient In the paper Continuous Control with Deep Reinforcement Learning Lil- licrap et al. [23] suggest an approach for transferring the success of CHAPTER 2. BACKGROUND 13

DQN to the continuous domain. They present a model-free, off-policy actor-critic algorithm that can be applied to high-dimensional, contin- uous action spaces, using DNNs as function approximators. They call this algorithm Deep Deterministic Policy Gradient (DDPG) and build upon the DPG algorithm of Silver et al. [31].

Similarly to the DQN algorithm a State-Action Value Function is used to evaluate how valuable a state-action pair is. However, contrary to the DQN algorithm, an action can not be chosen by selecting the maximum resulting State-Action Value. Since DQN operates in a dis- crete action space an optimization of the future reward can be done by traversing the finite number of actions available. In the case of a continuous action space the optimization required to find the best ac- tion values is too demanding. Instead a separate parametrized action policy is used which deterministically maps state to action.

The use of separate action policy and a State-Action Value Function is called actor-critic. The policy selects an action and the critic evaluates the selected action, such that the policy can be improved. However, the critic also has to be updated, which is done in the same fashion as in DQN. The actor on the other hand is updated using the policy gradient described in equation 2.10 to update it’s parameter’s ⇥⇡.

Q ⇡ ⇡ ⇡ ⇥ L Est ⇢ [ ⇥ Q(s, a ⇥ ) s=st,a=⇡(st ⇥ )]= ⇠ | r ⇡ r | Q | ⇡ ⇡ = Est ⇢ [ aQ(s, a ⇥ ) s=st,a=⇡(st) ⇥ µ(s ⇥ ) s=st ] (2.10) ⇠ r | | r | |

Similarly to DQN a replay buffer is used and mini-batches are sam- pled to update the network. All experiences are gained by following the current policy. To avoid re-exploring the same situations, explo- ration noise is added to the selected action.

The novelty in their approach is the fact that by combining the DPG algorithm with ideas from DQN training, Deep Neural Nets as func- tion approximators can be done in a stable and robust way. The ideas introduced in DQN that is incorporated with DPG into the DDPG al- gorithm is the use of a replay buffer and a target network. The former decorrelates the samples, while the later stabilizes training. 14 CHAPTER 2. BACKGROUND

The actor-critic architecture is implemented as a policy network, map- ping states to actions while the critic is a Q-Value network which eval- uate the action-value. The actor is updated by following the policy gradient, while the critic is updated based on a temporal difference error.

2.3.3 A3C Asynchronous Advantage Actor-Critic (A3C) suggested by Mnih et al. in the paper Asynchronous Methods for Deep Reinforcement Learning [25] is an on-policy algorithm, meaning that it can only learn from expe- riences gained with the current update of the policy, contrary to the previously described method which could sample old experiences. It works in both continuous and discrete action space. The actor is a policy which maps state to action, while the critic is a value function evaluating the utility of being in a particular state. Contrary to DQN and DDPG described above, A3C uses a mix of n-step return to up- date both the actor and the critic. In this context n-step return means that several time steps reward information is used as the target when updating the Value function.

V (s0)=r0 + V (s1) (2.11) 2 3 V (s0)=r0 + r1 + r2 + V (s3) (2.12) 2 n V (s0)=r0 + r1 + r2 + ... + V (sn) (2.13)

The algorithm uses a policy ⇡(a s ; ⇥) and a Value Function V (s; ⇥ ). t| t v These two are combined in one neural network, where all parameters ✓ are shared except the last layer, which outputs actions and value estimates separately. Both the ⇡ and V is updated every tmax time steps or when the agent reaches a terminal state. Using gradient ascent the neural network is then updated with the gradient as seen in equation 2.14 CHAPTER 2. BACKGROUND 15

⇡ ⇡ v ✓⇡,✓v L = ✓⇡,✓v log(⇡(at st; ✓ ))A(st,at; ✓ , ✓ ) r r | k 1 ⇡ i k v v = ⇡ v log(⇡(a s ; ✓ ))( r + V (s ; ✓ ) V (s ; ✓ )) r✓ ,✓ t| t t+i t+k t i=0 X (2.14)

The most important contribution of this work is the fact that no ex- perience replay is used. The authors show that DNN can learn a stable and efficient solution without such a replay buffer. Instead this ap- proach relies on parallel executions, of an agent collecting a sufficient amount of different experiences. Upon the reduced wall clock the ben- efit of multiple executions is the fact that the different agents will all collect different experiences that will be distinct enough that the prob- lem of samples being correlated is avoided.

Further the authors also build upon the idea of adding entropy to the policy as suggested by William & Peng [37]. The entropy term is said to improve exploration, thus reducing the probability of the policy converging to a sub-optimal solution.

2.3.4 TRPO Trust Region Policy Optimization (TRPO) is an algorithm suggested by Schulman et al. [30] in a paper bearing the same name. It is an on- policy algorithm applicable in both continuous and discrete domains with some modifications. The algorithm is an attempt to deal with the problem of gradient step size when updating policies represented by neural networks. Many other methods such as DDPG suffer from instability because of the effect taking too large steps can have. There- fore Schulman et al. introduce what is called a Trust Region. A Trust Region can theoretically be seen as a region in which all updates to the parameters are guaranteed to improve the policy.

The theoretically-justified procedure guarantees monotonic improve- ment. However to make the algorithm feasible to implement, a num- ber of approximations are necessary. This means that the guarantee no 16 CHAPTER 2. BACKGROUND

longer holds, but results show that TRPO still results in near mono- tonic improvement.

Like the previously described algorithms, TRPO aims to maximize the discounted future reward. Instead of directly minimizing a loss func- tion and applying gradient descent steps, like previous algorithms, the goal is to optimize a constraint problem.

⇡✓(at st) maximize Et | At (2.15) ✓ ⇡ (a s )  ✓old t| t subject to Ebt [KL[⇡✓old ( stb), ⇡✓( st)]] ·| ·|  In equation 2.15 ✓ representsb the Policies parameters, KL is the Kullback- Leibler Divergence and a threshold. The hat over the Advantage function and the Expected value indicate that it is an estimation of the actual values.

Equation 2.15 is similar to other policy gradient algorithms, however ˆ the At denotes the advantage function between ⇡✓ and ⇡✓old . The con- strain on the other hand is new and what is limiting the update of the network to only update within a trusted region. By using conjugate gradient methods the constraint-optimization problem can be solved.

2.3.5 PPO Proximal Policy Optimization (PPO) is another algorithm suggested by Schulman et al. [29]. The algorithm builds on the same ideas as TRPO. However, it approximates several aspects of the calculations required in TRPO which makes PPO significantly easier to implement, while empirically performing better than TRPO.

Instead of performing complex and computationally expensive com- putations to calculate a trust region, PPO approximates a region by clipping the allowed range of updates. This update strategy is refereed to as Conservative Policy Iteration (CPI). To further limit the regions in which updates are allowed, clipping of the allowed ratio between new and old polices rt(✓) is enforced. The region is regulated by hyper- parameter ✏. CHAPTER 2. BACKGROUND 17

CPI ⇡✓(at st) L (✓)=Et | At (2.16) ⇡ (a s )  ✓old t| t CLIP L (✓)=Ebt min(rt(✓)Atb, clip(rt(✓), 1 ✏, 1+✏)At) h i b b b

Equation 2.16 is a part of the loss used in PPO to train the policy. It can be seen as an adaptation of the constraint problem in TRPO, but the approximation makes it possible to directly use it as an objective when optimizing using Stochastic Gradient Decent.

Beyond the approximation made to limit the region which the policy is allowed to update, the authors also add an entropy term to ensure the agent explore sufficiently. This entropy term is added to LCLIP and is optimized with respect to ✓ when a terminal step is reached. Chapter 3

Related Work

In this thesis two research areas have been identified as particularly relevant. These are Reinforcement Learning for Robotic Control and Human Robot Interaction. Both these areas are closely connected to the research question in regard to what methods are there to automat- ically learn robotic control and how to tackle the problem of human robotic interaction at large and in particular how to optimize human robotic collaboration. This related work section cover the relevant pa- pers and ideas related to the two subjects.

3.1 Reinforcement Learning for Robotic Con- trol

Following Mnih et al.’s [26] successful application of Deep-Q Learn- ing to a number of Atari games, several authors have explored the possibility of using Deep Reinforcement Learning for Robotic Control. Similar to how Atari game play was mastered while only relying on raw-pixel data and reward, researchers have explored how raw data can be used to learn Robotic Control Policies.

3.1.1 End-to-End Visuomotor Policies Mnih et al.’s [26] DQN algorithm only relied on visual input. The possibility of only depending on visual data is one that have been explored for Robotic Control as well. This use of visual data as in- put and motor actions as output is often called Visuomotor Policies.

18 CHAPTER 3. RELATED WORK 19

Further, particular interest have been focused on what is called End- to-End Policies, which refers to the process of learning the mapping between input and output as one.

The idea of being able to train and use End-to-End Visuomotor Pol- icy hold great promise. However, training such polices have proven extremely difficult and inefficient. Foremost because of the large ob- servation and action space in most problems, which makes exploring and making sense of the environment extremely difficult for the agent. Several approaches have been suggested to counteract this problem. These will be touched upon in the following sub-sections.

3.1.2 Training in Simulation Using a simulated robot is a very practical tool when learning a policy. Physical robots are often very expensive, can be dangerous to work with and often require a human present during the full training time. Further, while each physical interaction takes a certain time, simula- tors can often run at speeds much higher than wall clock time. There- fore millions of time steps can be done in simulation, which is unfea- sible using physical robots. However, the end goal is often to use the policy on physical robots, therefore several papers have investigated how policies trained in simulation can be transferred to physical sys- tems.

Zhang et al. [40] examined the applicability of Deep-Q Learning on a Robotic Control task, which involved a 2-dimensional reaching task. A robot with three degrees of freedom is tasked with placing its end- effector at a goal position. The problem is first built in simulation and the authors then tries to transfer the learned policy to a physical Bax- ter robot without success. The learning algorithm used is the same DQN as the one used by Mnih et all. [26]. However, while the Atari game world has a discrete action space, the robot’s actuators operate continuously. Therefore either the action space or algorithm has to be adapted for the problem to be solved with the Deep-Q Learning algo- rithm. The authors choose to discretize the action space, such that each actuator can either increase the joint angle it’s controlling by 2, 0, 2 { } degrees of rotation. This strategy proved successful in simulation, the robot was able to learn how to reach the target area with high accu- 20 CHAPTER 3. RELATED WORK

racy. However, when transferring the learned policy to a real robot with an identical problem setup, zero percent of their trials were suc- cessful. The authors find that when substituting the camera image of the physical system with a simulated image of the current system the success-rate becomes consistent with the simulated robot. The authors draw the conclusion that the perception layers are the problem.

While Zhang et al.’s [40] application of DRL to a Robotic Control task showed some success, it exhibits many of the problems related to the application of DRL to Robotic Control. Foremost, the problem of trans- ferring policies from simulation to physical system, building a robust perception system, adapting DRL algorithms to the state and action space of the robot and foremost learning a good policy.

The problem of creating perception layers which are able to work in both simulation and in the physical world is one that several authors have struggled with [40], [21]. Many ways of improving the transfer- ability of policies trained in simulation have been suggested. One is to randomize the position of the camera when training, such that the al- gorithm is robust to minor differences in camera angles [15]. Another is training with random objects in the background [21], or adding noise to the raw-pixel data [40]. However, these methods have showed lim- ited success, but work done by Tobin et al. [33] and James et al. [15] using texture randomization in simulation has shown great promise. Both their approaches use different non-realistic textures while train- ing.

3.1.3 Imitation Learning Many Reinforcement Learning problems use a problem set up such that no instructions are given to the agent. Thus, it is up to the agent to explore and learn how to behave such that the expected cumula- tive reward is maximized. When the observation and state spaces are small this is feasible to do in a limited timescale. However, when using visual data and continuous actions, the observation and action spaces are hard and time consuming to explore. To counteract this problem one option is to use demonstration, which is similar to how humans learn quicker when instructed how to perform and a task. CHAPTER 3. RELATED WORK 21

Levine et al. [21] developed an end-to-end visuomotor policy that could learn considerably quicker than the approach suggested by Zhang et al. [40] mentioned above. Levine et al. suggest using a method called Guided Policy Search (GPS) developed by Levine et al. [20]. This method shifts the RL problem into a Supervised Learning (SL) problem, thus enabling the learner to use modern SL methods to train it. Data is provided using a trajectory-centric approach where a sim- plified version of the problem is solved using optimal control solvers. The result of the approach is a significant reduction of training time. Further the results also suggest that an end-to-end approach actually outperforms methods for which perception and policy are trained sep- arately. However, the use of an optimal control solver greatly reduces the generalizability of the solution.

The approach of using optimal control solvers to guide the learner to a good solution drastically decreased the number of samples necessary for the learner to find a good solution. Another method to guide the learner is to pre-train it with expert examples and then fine tune the performance using RL, as done by Kober et al. [16]. This approach have also been implemented using the latest RL methods and resulted in learning speedups of an order of magnitude [27].

3.1.4 Auto encoder Learning The problem of large state and action spaces reducing the sample effi- ciency of RL algorithms have historically been tackled by hand-crafting low dimensional features, which requires human intervention. To au- tomate this process several authors have suggested compressing the state space, such that high-dimensional data is encoded in a more effi- cient manner. This enables the use of smaller networks and the smaller representation results in a quicker exploration of the state space.

Finn et al. [8] used this approach, they pretrained an auto-encoder with robot-arm positions. Once the auto-encoder was properly trained, the encoder part could be used as a way of compressing the state space into a smaller representation. Thereafter it could be used to train a smaller network requiring less data. 22 CHAPTER 3. RELATED WORK

Beyond using an encoder to compress the information in the state- space auto-encoders have been used in other capacities to streamline RL in complex environments. Ghadirzadeh et al. [10] used a decoder such that a lower dimensional representation of a large action space can be explored efficiently, thus making it feasible to tackle problems with both large state and action space.

Similarly to Finn’s and Ghadirazadeh’s work, van Hoff et al. [13] also used auto encoders to learn a policy. However, their approach was not only applied to visual data. The authors also applied auto encoders to high dimensional sensory data, instead of extracting important in- formation from very high-dimensional data. Similarly to the high- dimensional data the goal is to eliminate unnecessary dimensions and also filter out noise. They show that without using an Anuto Encoder to de-noise and compress the input, the task can not be solved.

3.1.5 Domain Randomization and Large Scale Data Collection Training DRL algorithms often require an exorbitant amount of data points, often in the order of millions. This is often not feasible on a physical robot, therefore simulations have been used instead. How- ever, a policy in simulation is often ineffective on physical robots and historically it has been hard to transfer a policy from simulation to a physical robot.

To enable the use of DRL on real robots several approaches have been suggested, however many relies on human intervention. Neverthe- less, two different approaches have shown special promise. Large scale data collection and Domain Randomization.

By using several robots data collection can be done in parallel and thus speed up the process of gathering experiences. Both Gu et al. [12] and Levine et al. [22] suggest an approach where multiple robots are used simultaneously, resulting in data collection speedup close to linear with the number of robots. This makes simple tasks feasible to train on real robots. However, a number of problems related to using physical robots still remains. Such as cost, physical danger and time CHAPTER 3. RELATED WORK 23

consumption.

Although training in simulation has shown promise, transferring the learned policies from simulation to physical robots has proven chal- lenging. These challenges are both due to inaccuracies in the phys- ical simulator, but also the visual rendering of the environment. To counteract the second problem, Tobin et al. [33] and James et al. [15] suggest domain randomization. The main idea in both papers is that by randomizing the textures of all surfaces in simulation, the texture and thus appearance of textures in real life physical settings become irrelevant to the policy. This approach enabled the authors to use the policy trained in simulation to a physical robots directly. The approach also enabled the algorithm to ignore irrelevant background objects and changing light conditions.

3.1.6 Prioritized Experience Replay The use of a replay buffer have been pointed out as one of the major reasons behind the latest breakthroughs in RL. The reply buffer gath- ers the experiences of the agent in memory. Throughout training the buffer can then be randomly sampled, thus large amounts of tempo- rally uncorrelated data can be used for training. However, all these experiences might not be as useful as all others. This is an idea that Schaul et al. [14] explored and formalized as Priority Experience Re- play. The authors suggests that all experiences are ranked based on how important the samples are. The importance can be defined in several ways, but the approach suggested in their paper uses a nor- malized exponential decaying probability of choosing an experience based on the estimated temporal difference error. The authors show that this decreases the time necessary for the algorithm to learn signif- icantly.

Another approach to increase the efficiency of the algorithm, with- out requiring more interactions with the environment is the use of Hindsight Experience Replay (HER). HER is an alteration to basically any off-policy RL algorithm. Andrychowicz et al. [2] suggested the method as a way of learning from experiences that did not succeed in environments with sparse rewards. The idea is that similarly to how humans learn from failing or nearly succeeding, so should also RL al- 24 CHAPTER 3. RELATED WORK

gorithms be able to. This is done by including the goal in the input to the function approximator, for example a DNN. Say for example that the goal is a x and y coordinate, then the original goal and reward can be swapped for the coordinates achieved and corresponding reward. This enables the algorithm to learn in environments with sparse re- ward and can speed up learning. This approach was used to success- fully learn Robotic Control Policies for pushing and nudging objects with sparse reward.

3.2 Human-Robot Collaboration

In the area of HRC, tasks are often formulated as belonging to a fixed role distribution between Human and Robot. Either an equal role shar- ing can be utilized or either part can act as a leader and the other fol- lower. One of the most common approaches is to let the human act as leader [9].

In the case of the human collaborator acting as the leader there have been several different approaches suggested, with varying degree of complexity. Wojtara et al. [39] proposed a simple approach to jointly placing an object. The human selects a trajectory and the robot follows while only compensating for the load. Kosuge et al. [17] proposed us- ing impedance control and tested this on one degree of freedom (DOF). The same idea was tested by Duchaine and Gosselin [5], with two and three degrees of freedom. However, when using impedance control on systems with four DOF or higher, it becomes hard for the robot to understand whether the user’s intended action is translation or rota- tional.

Other more complex methods which try to anticipate the future move- ment of human collaborators have been suggested. Meada et al. [24] proposed using the minimum jerk model to predict the human move- ment, however this also requires a known final position prior to start- ing the task. Another proactive approach that requires prior knowl- edge was suggested by Gribovskaya et al. [11]. Their approach use human demonstration gathered by collecting data from the robot col- laborator when controlled by a human. By training a dynamic forward model using the collected data, the robot could then use the model to CHAPTER 3. RELATED WORK 25

interpolate future movement. Similar methods have also been used to train robots to be able to share the leader and follower role. Bussy et al. [4] provided the robot with predefined trajectories combined with impedance control such that it can both follow a successful path while being able to adapt to the collaborator.

A mixture of the leader-follower approach have also been suggested which let the collaborators hold both roles simultaneously. Don Joven Agravante et al. [1] did this by dividing the DOF up between the robot and human. They tackled the task of balancing a ball at a set position on a table while moving the table between point A and B. The division of DOF was constant and divided by what DOF was used for the two sub tasks of moving and balancing. The robot used an impedance con- trol method when acting as a follower for the moving task, while using a PD controller for the balancing task in which it took the role of leader.

Ervind and Kheddar [7] proposed a method to use flexible role assign- ment. Instead of assigning the role of leader to the human collaborator, the leader role can be switched to the robot and back. By endowing the robot with two control methods, one for leading and one for follow- ing, it can switch between them to optimize performance. In practice this means that when acting as the leader the robot minimizes the er- ror distance to a predefined path and the follower try to minimize the forces at the robot-object contact point.

The methods covered above all rely on hand crafted control policies. They can be complicated to formulate and are associated with several drawbacks. The impedance methods require well calibrated sensor measurements [9]. Equal role sharing algorithms require predefined motion trajectories [9] and most of the methods mentioned also require parameter tuning by experts. Some work have been done to counter these problems and find control policies automatically, Ghadirzadeh et al. [9] proposed framing the task as a RL problem. By giving the robot an objective function similar to that of the human collaborator, which results in an equal role sharing. This results in collaborators in which both try to reach the same goal and behave as the other part will select actions that results in the objective being reached. The method utilizes two elements, a forward model and an action value function. These two components are represented by Gaussian Processes which 26 CHAPTER 3. RELATED WORK

are trained using Q-Learning. Using this approach the authors suc- cessfully trained a policy which can collaborate with a human, with- out high-level modeling of human actions. The authors attribute this to the fact that the problem can be grounded in raw sensorimotor ob- servation space of the robot.

The work described above have been successful at solving the tasks at hand, however, these tasks are often limited in the applicability to real world situations. One major drawback is that most of the work within the area have been limited to problems of maximum three DOF. While robots of three DOF might be useful in certain situations, it does limit the applicability of the research in regards to the goal of making human robotic collaboration useful in everyday scenarios. Three DOF freedom might not be sufficient for many unstructured real life prob- lems in which human robotic collaboration might be practical. Whit- sell and Artemiadis [36] explored the possibility of controlling a robot of six DOF to together with a human collaboratively place an object. The authors suggested a method in which a flexible leader follower relations is established for each DOF. During usage the robot will con- tinuously evaluate if the collaborator want to act as leader or follower. The authors evaluate two different methods for swapping leadership, one based on force thresholds in each DOF and one RL approach. By rewarding the robot for switching when the human intended to and penalizing it otherwise. The RL method outperformed the threshold approach and the robot was able to successfully solve the placing task 100% of the time. Chapter 4

Method

In this section a detailed description of the problem will be presented as well as the methods used to tackle the problem. Further, descrip- tions of the experiments performed are presented.

Figure 4.1: A visualization of the environment.

27 28 CHAPTER 4. METHOD

4.1 Problem

As described in the introductory chapter, a toy problem is used as a way of testing RL methods on Human Robotic collaboration situa- tions. Due to the numerous challenges involved with implementing RL on physical robots the experiments are carried out in a computer simulation.

The problem is formulated as a collaborative balancing task. A hu- man and a robot is both holding on to a table without legs with a ball placed on top. Together they are tasked with moving the table in such a way that the ball follows a predefined path. This is a task that is an adaption of similar tasks in several recent studies.

The problem is implemented in a physics engine called MuJoCo [34]. The environment can be seen in figure 4.1. It contains two robots, one representing a human and the other a robot, two horizontal control bars which the robots hold on to and the small ball meant to be con- trolled. The control bars are connected by two threads each to the board, similar to how a marionette doll is connected.

Due to the fact that the experiments are carried out in a simulator di- rect human interaction is not possible. Beyond this, the large number of time steps required to train most RL algorithms makes human con- trol by for example keyboard infeasible. Instead human movement is simulated by a Proportional Derivative controller with the individual variation between human movement patterns is emulated by varying the control parameters.

Goal The task of jointly balancing a ball along a predefined path is here manifested by following an imagined goal position on the plane. This goal is meant to mimic the objective of the human collaborator. The imagined goal position is updated each time step and follows a circu- lar path around the center of the board.

The goal is for the ball to constantly be as close to the desired posi- tion as possible. The deviation is measured using euclidean distance CHAPTER 4. METHOD 29

and is meant to be minimized. Based on this distance a reward func- tion will later be formalized in equation 4.5. The reward each time step is normalized such that a distance of zero results in reward of 1.

4.2 Implementation

4.2.1 Physics Engine The problem is simulated in the physics engine MuJoCo [34]. The physics engine is built with model-based control in mind. Further, MuJoCo is also the standard physics engine of OpenAI’s RL toolkit Gym for developing and comparing RL algorithms [3]. Gym also con- tains a wrapper for the MuJoCo framework and provide an API to use other programming languages than C++ which MuJoCo is built upon. Further, by standardizing the environment interactions, such that the same algorithm can be used for any other environment (not including between discrete and continuous domains) it has become one of the most used RL tool-kits. Due to these facts, Gym combined with Mu- JoCo was chosen as the simulator in which the experiment is carried out within.

4.2.2 Robot Environment The robots used in the simulator is a Fetch [38]. An open source model of the robot was built in the MuJoCo physics engine for openAI’s gym environments Robotic Fetch. The model includes CAD models of all parts, correct representation of all actuators and joins, as well as an inverse kinematics solver. The model is used as in a sta- tionary mode and only the seven DOF arm is active. However, the action space that is controlled is only six DOF, these being the xyz- coordinates of the end effector as well as the rotation around it’s pitch, yaw and roll axis.

The Fetch Robot is controlled by providing change in coordinates and angles which will be performed if it complies with the constraints posed by the robot and if physically possible in the simulation. 30 CHAPTER 4. METHOD

Type Number Dimension Robot 2 See appendix Ball 1 Radius 0.025 m Board 1 0.6x0.6x0.02 m Control Bar 2 0.02x06x0.02 m Control String 4 0.2 m

Table 4.1: Content

The simulation environment described in subsection 4.1 contains a number of physical objects with dimensions listed in Table 4.1. Be- yond the physical objects a number of sensors are mounted on the system and provide information about the environment. The modali- ties from these sensors are listed in table 8.1 along with any other data available.

4.2.3 Human Movement Simulation As mentioned in section 4.1, a Proportional Derivative (PD) controller is used to simulate human movement. A PD controller is used because of its simplicity, lack of required system knowledge as well as the pos- sibility to emulate individual human movement patterns by varying parameter values.

The PD controller is a control loop feedback mechanism which con- tinuously calculates the error value e(t). In this application the error is the Euclidean distance on the table between the desired position and actual position of the ball. Since the ball is intended to have contact with the board at all times, the z coordinate of the ball in the board co- ordinate frame is constant. This leaves only the x and y axis for the ball to vary over. Thus, only two DOF needs to be controlled. These two DOF can be controlled by varying the angle around the roll axis and the z coordinate. Although the system can be controlled in six dimen- sions only two are needed to be able to reach sufficiently high accuracy on the task. Agravante et al. [1], performed a similar balancing task and found that two DOF was sufficient to accurately maneuver a ball to a desired position.

The PD controller accounts for the error in the x and y direction sep- CHAPTER 4. METHOD 31

arately. Each dimension is thus controlled separately. The ex is con- trolled by varying the z coordinate and the ey by the Roll angle.

e (t)=x (t) x (t) (4.1) x reference position e (t)=y (t) y (t) (4.2) y reference position

ux(t)=Kp,xex(t)+Kd,xe˙x(t) (4.3)

uy(t)=Kp,yey(t)+Kd,ye˙y(t) (4.4)

ki t

r(t) u(t) y(s) ⌃ kp ⌃ System e(t)

tkd

Figure 4.2: PID Controller Overview

The system only relies on the Proportional and Derivative part of the PID. This is done by setting Ki,x = Ki,y =0. Further, setting Kd,x = Kd,y and Kp,x = Kp,y was empirically tested and found to result in desired behaviour when simulating human motion. The numerical values used for Kd and Kp is varied between the experiments.

4.2.4 Reinforcement Learning Algorithm While a PD controller was used to simulate the human collaborator, the robot was controlled using a Proximal Policy Optimization algo- rithm PPO. PPO is currently the state of the art algorithm for RL prob- lems and is used as the go-to algorithm for openAI’s gym environ- ments [29]. The fact that it embodies the stability of TRPO while being easy to implement and require less time to learn good policies makes it a good choice. Other algorithms such as DDPG and TRPO were tested, but performed significantly worse and all experiments have therefore been performed using PPO. The PPO implementation used is an adap- tation of openAI’s baseline code [6]. 32 CHAPTER 4. METHOD

The use of Prioritized Experience Replay (PER) and Hindsight Expe- rience Replay (HER) as described in section 3.1.6 have shown promise when used for Robotic Control [2], especially when used in sparse re- ward settings. However, both PER and HER require off-policy algo- rithms and is therefore not currently possible to use in combination with PPO [28]. Further, due to the fact that PPO have been shown to outperform DDPG in combination with HER [28], both PER and HER have not been included.

4.2.5 Observation Space The observation space of the Agent varies between experiments and an exact list of which sensor modalities are available in each experi- ment can be found in respective experiment section. A list of all avail- able sensor modalities can be found in table 8.1. These include force, torque and velocity sensors as well as coordinates of relevant bodies.

The data and sensor values available from the MuJoCo simulator are vast. However, to make the available data realizable on a real robot, only sensor values and coordinate readings known to be feasible in other similar research problems on physical systems have been used. For example, although the coordinates of the ball on the table are ex- plicitly calculated in the simulator, ball coordinates can be calculated using basic image recognition systems as done by for example Agra- vante et al. [1].

Several of the papers mentioned above in the related works section tackled problems with only visual data as input, this is not the case for this thesis. The choice of not including visual data is based on several factors. First of all, visualization of the environment is significantly more time consuming and requires more computational resources for each time step. Second, the DNNs used have to be adapted and ex- panded to be able to handle visual data, which is both more computa- tionally expansive as well as harder to train which adds another level of complexity not necessary to answer the research question. Lastly, the information extracted from visual data is often hand crafted and easily found using simple image analysis methods. While Levine et al. [20] found that training the perception layer in combination with the CHAPTER 4. METHOD 33

rest of the data was slightly better than training them separately, the advantage was not significant and other projects such as [8] and [10] require sophisticated pre-training, which decreases generalizability.

Finn et al. [8] and Ghadirzadeh et al.[10] methods’ relied on spatial auto-encoders and while successful, based on the need of pretrain- ing and lack of generalizablity, their approaches were not used in this thesis. Instead based on the fact that simpler imaging methods are powerful enough for the task of tracking an object, visual data was disregarded and coordinates was instead collected straight from the simulator.

4.2.6 Design choices

DRL for Robotic Control The numerous approaches to design Robotic Control Policies using Robotic Control described in the previous work section all have their own perks and drawbacks. Broadly speaking the most significant dif- ference between them is how and to what extent human intervention is used. Regardless if humans are involved by providing example path trajectories, designing feedback control loop systems or by oversee- ing the progress of a physical robot’s learning, human involvement does have a significant effect on the performance and feasibility of the method.

While the methods that rely heavily on human intervention often out- performed those that did not, it is also limiting in other senses. As stated previously the goal is to use as little human intervention as pos- sible when training the algorithm. Therefore a number of choices were made.

While implementation on physical robots is the end goal, training on physical systems has several drawbacks as mention previously. There- fore training in simulation might be preferred, following the success of Tobin et al. [33] and James et al. [15], who showed that using well de- signed models and Domain Randomization one can transfer a policy learned in a simulator straight to a physical robot. Further, to fully utilize the advantage of being able to perform millions of time steps in simulation during a much shorter wall clock period, human involve- 34 CHAPTER 4. METHOD

ment and interaction in the simulator is not used since it would result in a loss of the time step per second gains associated with the simula- tor.

Altogether, all human involvement or specific pretraining or guiding by human was avoided, except for the PD controllers emulating of hu- man behaviour. Instead the algorithm relies completely on learning from raw data using the PPO algorithm and two Feed Forward Neu- ral Networks, one representing the value function V (st) and the other the policy ⇡(a s ). The networks both used three fully connected hid- t| t den layers with tanh activation functions, each layer consisting of 64 neurons.

Human Robotic Collaboration The work done within human robotic collaboration covered in sec- tion 3.2 can roughly be categorized by the leader follower division of work and whether the control algorithm was exclusively reactive or also proactive. In these experiments the choice of division of work and reactive vs. proactive control is not done explicitly.

Since the experiment is designed to only rely on raw data and feed- back in the form of a numerical reward no choices regarding how the robot behaves or what role it takes is specified. Instead the idea is for the policy to learn what behaviour is best to maximize reward and thus perform well at the task. This might mean that the policy is re- active or proactive. But this is not something that has been specified before learning. However, by measuring e.g. mutual information from the resulting movement patterns, conclusions regarding reactivate can potentially be drawn after a policy has been learned. CHAPTER 4. METHOD 35

4.3 Experimental Setup

Several experiments are carried out in order to investigate the research questions. These will be presented in the following section.

4.3.1 Common Details The simulation is done using two agents, one robot controlled by RL and a collaborator controlled by a PD controller, simulating the be- haviour of a human. These will here on after be refereed to as agent and collaborator respectively. Further, robot refers to the physical rep- resentation of both agent and collaborator.

The experiments are carried out in trials, each trial consists of a max- imum of 400 time steps. Each trial will terminate after 400 time steps or if the ball touches the railing of the table. Each trial begins with the robots in their initial position and the ball placed randomly within a distance of 0.1m from the center of the board. An imagined goal po- sition for the ball is placed at the point on the circle of radius 0.13m closest to the initial goal position. Based on the range of allowed ball speeds a random speed value is selected within that range and the goal ball position is then moved along the radius.

The reward is evaluated in each time step with a maximum value of 1. The exact reward each time step is calculated using equation 4.5 where d(t) is the Euclidean distance between the desired ball position and the current ball position measured in meters.

1 r(t)= (4.5) 1+(d(t) 10)2 ⇤

Previous work within the area has mostly been focused on action spaces of four DOF or below. To test if it is possible to use more than four, up to six DOF is tested. Further, to asses how learning is affected by the number of DOF all experiments are carried out with two-six DOF, un- less stated otherwise. The dimensions available for control depending on the number of DOF can be found in table 4.2. 36 CHAPTER 4. METHOD

DOF X Y Z Pitch Yaw Roll 2 - - x - - x 3 - x x - - x 4 x x x - - x 5 x x x x - x 6 x x x x x x

Table 4.2: For each number of DOF different dimensions are control- lable. A x means that the dimension is controllable for the agent acting with a corresponding number of DOF

The performance of the learned policies are all evaluated over a num- ber of different PD-controller parameter values P [0.1, 1.25], D 2 2 [0.1, 1.1] and the desired ball speed v = [ 1.5, 1.5] to simulate how 2 it generalizes to other human behaviour, ranging from skilled to un- skilled.

Two different observation spaces are used. Both have access to all data regarding the ball and table position velocity etc. The difference be- tween the two is the amount of sensory data available to the agent. One observation space contains information from both robots and is used in the experiment named Balancing with More Information. The other observation space is used in the rest of the experiments and only contains sensory data from the robot the agent is operating through. The data available can be found in table 8.1.

4.3.2 Collaborative Balancing with Varying Level of Skilled Human Partner To evaluate if DRL can be used as a way of learning a Robotic Control Policy an agent is trained with another robot controlled by a PD con- troller emulating the collaborating human, which we called the collab- orator. As mentioned previously, by varying the P and D parameters different behaviours and performances can be achieved. A separate RL policy is learned by the robot for each of the conditions. These experiments are all performed using between two to six DOF agents. CHAPTER 4. METHOD 37

One Collaborator To evaluate if an agent can learn anything useful at all, a single collab- orative partner is used. The agent only interacts with a collaborator with P =1.0 and D =0.2 with a set desired ball speed of v =1.0. Thus, this experiment essentially evaluate if the algorithm can learn to collaborate with a single human.

Good Collaborators To evaluate if the agent can learn to collaborate with a number of dif- ferent simulated humans, a range of parameter values and desired ball speeds are used. These are the following P [0.905, 1.135], D 2 2 [0.15, 0.25] and a speed of v [ 1.50, 1.50]. The negative range is used 2 to allow the collaborator to roll the ball both clockwise and counter clockwise. The parameter values used are all within a range which perform well.

All Collaborator To evaluate how the performance of the algorithm is effected by the range of collaborators it interacts with while training, the full range of test parameters are also used when training. This is P [0.1, 1.25] 2 and D [0.1, 1.1] and a speed range of v [ 1.50, 1.50]. This essen- 2 2 tially evaluates how well it can learn to collaborate with a large num- ber of collaborators, as well as how it can learn when working with both good and bad collaborators.

Bad Collaborator To evaluate how the performance of the algorithm is effected by train- ing with collaborators that can not solve the task itself, parameter val- ues which perform poorly are used. The parameters are P [0.1, 0.330] 2 and D [0.15, 0.25] and a speed range of v [ 1.50, 1.50]. This es- 2 2 sentially evaluates how well it can learn to collaborate with unskilled humans and improve the performance of a bad partner.

Balancing without Collaborator By performing the task without any collaborator the agents ability to learn the task without the collaborator can be evaluated. This can be 38 CHAPTER 4. METHOD

used to study what impact the human has on the agent’s learning and end performance. Therefore the robot emulating the human is set to remain still, while the agent solves the task independently. This is also tested with two to six DOF.

4.3.3 Balancing with DRL Collaborator To further test the agents ability to tackle the task, the ability to balance the ball while controlling both robots is tested. Each robot arm oper- ates with five-DOF and a single agent control both robot arms. The agent is given the sensor data from both robots.

4.3.4 Balancing with More Information To test the impact of providing the robot with more data, the robot is provided with data from the human collaborator. This means all information about its position, velocity etc. The PD values used are P =1.0 and D =0.2 to simulate a good collaborator and the arm is set to five-DOF. Chapter 5

Experiments

The experimental findings section will contain a presentation of the re- sults obtained from the experiments described above, combined with an analysis of these results.

5.1 Results

As stated above, the performance of the learned policies are all evalu- ated over a number of parameter values P [0.1, 1.25], D [0.1, 1.1] 2 2 and v =1.0 to see how it generalizes to other human behaviour, rang- ing from skilled to unskilled. The performance is evaluated for ten different P and D values, which are selected such that the min and max values of the respective intervals are evaluated over eight evenly spaced parameter values in between these. Testing each combination of P and D values, thus, results in 100 evaluations. For each combi- nation of P and D values ten trials are performed, each with a unique starting position. However, these starting positions are not varied be- tween parameter combinations.

5.1.1 Analysis The experiments in this section include a number of interesting results. In the following section a number of the implications of these will be discussed.

39 40 CHAPTER 5. EXPERIMENTS

Figure 5.1: Distribution plot of the reward density over the reward spectrum. The green distribution represent the performance of the PD controller performing the task itself, each rug (marking at bottom of each subplot) represent a data point with a unique set of P and D val- ues. The red distribution is based on the performance of PPO policy performing the task individually without any collaborator. Each sub- plot represent the number of DOF the PPO algorithm control. CHAPTER 5. EXPERIMENTS 41

Figure 5.2: The average reward achieved over the number of time steps. The PPO policy is trained without a PD collaborator. Each line represent the number of DOF the agent control.

5.1.2 Performance of Human Collaborator Acting Alone The performance of the PD controller emulating humans performance is used as a baseline. As seen in the distribution plots, such as faintly seen in 5.1 or more clearly in 5.3. The performance of the emulated human is in the range of R [80, 320]. The wide range of reward can be 2 attributed to the variety of P and D values used to mimic differences in performance between humans.

5.1.3 Performance of Robot Collaborator Acting Alone To asses if the problem at hand could be tackled with a DRL algorithm to any extent, the complexity of the problem was reduced by setting the human collaborator to be static at it’s initial position. Thus, the sit- uation was the same as with the human collaborator acting alone. The result of the experiments can be seen in Figure 5.1. For each dimension the range of results is clearly more concentrated than in the the case of the human acting alone. This is expected, since the variety of human behaviours introduced in the PD controller is not present in PPO Pol- icy.

The reward achieved by the PPO Policy acting alone, varies greatly 42 CHAPTER 5. EXPERIMENTS

Figure 5.3: See Figure 5.1’s caption for common details. The red distri- bution is based on the performance of the PD controller collaborating with the PPO. These PPO policies were learned using P =1.0, D =0.2 and v =1.0 CHAPTER 5. EXPERIMENTS 43

Figure 5.4: The average reward achieved over the number of time steps. The PPO policy is trained with a PD collaborator with P =1.0, D =0.2 and v =1.0 and evaluated over the same parameters as well as a standardized start position. Each line represent the number of DOF the agent control. between the number of DOF the agent is allowed to control. However, by studying the graph in Figure 5.2 one can see that regardless of the number of DOF, the agent’s average reward increased with the num- ber of time steps. This indicate that the PPO is capable of learning a Robotic Control Policy which improves with experience.

The average performance of each agent seems to increase with the number of DOF it control, except for the six DOF. However, the six DOF agent does not seem to have plateaued, thus, it is still possible that the average reward would increase if it is allowed to train longer. Except for the two and three DOF agents, the average reward is higher than that of the average emulated human acting alone. Both the four and five DOF agents policies perform on par or even better than that of the best emulated human acting alone reaching rewards in the range of r [330, 335]. This is encouraging results, indicating that a DRL 2 policy can outperform established hand crafted control methods. 44 CHAPTER 5. EXPERIMENTS

Figure 5.5: The average return of the collaborators over the P and D parameter values (left). The average relative return of the collaborators compared to the performance of the PD controller acting alone using the corresponding P and D values (right). The individual heatmaps represents the number of DOF the agent can control, from 2-DOF in the top row to 6-DOF in the bottom row in descending order. The PD collaborator’s parameter settings allowed ranges are represented by the small green square, being P =1.0, D =0.2 and v =1.0. CHAPTER 5. EXPERIMENTS 45

Figure 5.6: See Figure 5.1’s caption for common details. The red distribution is based on the performance of the PD controller collab- orating with the PPO. Each subplot represent the number of DOF the PPO algorithm control. These PPO policies were learned using P [0.905, 1.135], D [0.15, 0.25] and v [ 1.50, 1.50]. 2 2 2 46 CHAPTER 5. EXPERIMENTS

Figure 5.7: See Figure 5.4’s caption for common details. The PPO policy is trained with a PD collaborator with P [0.905, 1.135], D 2 2 [0.15, 0.25] and v [ 1.50, 1.50] and evaluated over the same parame- 2 ters as well as a standardized start position.

5.1.4 Performance of Human-Robot Collaborator The main goal of the thesis is to evaluate if DRL algorithms can be used to train a Robotic Control Policy capable of collaborating with a human. Following the results indicating that the robot was able to bal- ance the ball alone relying on the PPO algorithm, experiments evalu- ating the applicability of it in the collaborative setting was carried out. Several experiments were carried out to investigate the impact of the emulated humans skill level, diversity of skill level and more effects the performance of the final policy.

The variety of behaviours that can be mimicked by varying the P , D as well as the goal speed v have a significant impact on the per- formance of the human imitating robot. Some combinations produce results close to the maximum achievable, while other end up termi- nating the episode prematurely with a cumulative reward lower than that achieved by keeping the table still. Consequently, the parameter settings used when training can have a range of effects on the learning. Ranging from a guiding to sabotaging behaviour. CHAPTER 5. EXPERIMENTS 47

Figure 5.8: See Figure 5.5’s caption for common details. The aver- age return of the collaborators over the P and D parameter values (left). The average relative return of the collaborators compared to the performance of the PD controller acting alone using the corre- sponding P and D values (right). The PD collaborator’s parameter settings allowed ranges are represented by the small green square, be- ing P [0.905, 1.135], D [0.15, 0.25] and v [ 1.50, 1.50]. 2 2 2 48 CHAPTER 5. EXPERIMENTS

A Single Good Collaborator The results seen in Figures 5.3, 5.4 and 5.5 are generated by using a sin- gle good collaborator with the same goal velocity at all times v =1.0 when training the agent. The single good performer reduce the com- plexity of the problem by only having to learn to collaborate with a single collaborator, who’s behaviour is more predictable compared to training with a wide range of partners.

As seen in Figure 5.4, the agent is able to learn how to improve the cumulative reward, however, it is very sensitive and the empirically evaluated cumulative reward is oscillating over time. Thus, the time step the policy is evaluated in might have a significant effect on the evaluation result seen in Figures 5.3 and 5.5. However, the general trend of higher DOF outperforming lower DOF seem to be consistent although oscillating.

The policies evaluated after 1,000,000 time steps exhibit a wide vari- ety of performances. The two and three DOF agents perform worse than the human acting alone while the four DOF agent concentrate the performance around the average, decreasing the best scores while also limiting the very poor results. The five and six DOF have successfully pushed the performance of all P and D combination above the average of the baseline. Further, the six DOF agent also outperform both the maximum score achieved in the baseline as well as the performance of the single robot acting alone. This indicates that Human-Robot collab- oration can be used to outperform individual performance.

Another interesting phenomenon can be seen Figure 5.5 (right). The figure displays the relative reward achieved compared to the baseline. A white square indicates a score equal to that of the baseline, a blue an improvement and a red a decrease in score. The six DOF agent per- form on par with the baseline or better for each combination of P and D. This suggests that even though only trained with one skilled col- laborator it is able to improve any other human’s score although the behaviour is previously unknown to the agent. CHAPTER 5. EXPERIMENTS 49

Multiple Good Collaborators To evaluate how the agent was able to learn in collaboration with a number of collaborators who all exhibit good results but with a little wider range of behaviours, a larger range of P , D and v was used. While a little wider range of P and D was used in this experiment, the major difference is the range of goal velocities used while training.

The results obtained from this experiment can be seen in Figures 5.6, 5.7 and 5.8. The distribution plot seen in Figure 5.6 exhibit a simi- lar trend as in Figure 5.3, although slightly more focused around the mean of R =200(except for the six DOF agent). Further, this focus of values can also be detected in Figure 5.8 (right), the heat-plot indi- cates that the collaboration have increased the performance of poorly performing collaborators, while decreasing the performance of a bad collaborator.

The agent in control of six DOF stands out from the others, as it both achieved the highest max reward, outperforming the maximum re- ward achieved by the baseline, and have the widest range of reward R [130, 335]. Although the minimum score r =130is much lower 2 than that of the five DOF agent (R =180), it is actually the only agent that improves or does not decrease the performance of any of the P and D combinations, as seen in Figure 5.8 (right). Thus indicating that the collaboration helps the overall score of the human collaborator re- gardless of skill level. 50 CHAPTER 5. EXPERIMENTS

Figure 5.9: See Figure 5.1’s caption for common details. The red distri- bution is based on the performance of the PD controller collaborating with the PPO. Each subplot represent the number of DOF the PPO al- gorithm control. These PPO policies were learned using P [0.1, 1.25], 2 D [0.1, 1.1] and v [ 1.50, 1.50]. 2 2 CHAPTER 5. EXPERIMENTS 51

Figure 5.10: See Figure 5.4’s for common details. The PPO policy is trained with a PD collaborator with P [0.1, 1.25], D [0.1, 1.1] and 2 2 v [ 1.50, 1.50] and evaluated over the same parameters as well as a 2 standardized start position.

Figure 5.11: See Figure 5.4’s for common details. The PPO policy is trained with a PD collaborator with P [0.1, 0.330], D [0.15, 0.25] 2 2 and v [ 1.50, 1.50] and evaluated over the same parameters as well 2 as a standardized start position. 52 CHAPTER 5. EXPERIMENTS

Figure 5.12: See Figure 5.5’s caption for common details. The aver- age return of the collaborators over the P and D parameter values (left). The average relative return of the collaborators compared to the performance of the PD controller acting alone using the corre- sponding P and D values (right). The PD collaborator’s parameter settings allowed ranges are represented by the green square, being P [0.1, 1.25], D [0.1, 1.1] and v [ 1.50, 1.50]. 2 2 2 CHAPTER 5. EXPERIMENTS 53

Figure 5.13: See Figure 5.1’s caption for common details. The red distribution is based on the performance of the PD controller collab- orating with the PPO. Each subplot represent the number of DOF the PPO algorithm control. These PPO policies were learned using P [0.1, 0.330], D [0.15, 0.25] and v [ 1.50, 1.50]. 2 2 2 54 CHAPTER 5. EXPERIMENTS

Figure 5.14: See Figure 5.5’s caption for common details. The aver- age return of the collaborators over the P and D parameter values (left). The average relative return of the collaborators compared to the performance of the PD controller acting alone using the corre- sponding P and D values (right). The PD collaborator’s parameter settings allowed ranges are represented by the green square, being P [0.1, 0.330], D [0.15, 0.25] and v [ 1.50, 1.50]. 2 2 2 CHAPTER 5. EXPERIMENTS 55

All Collaborators To evaluate how the performance of the agents was impacted when training with the same parameter settings as when testing, the full range of P , D and v was used. The results obtained from this exper- iment can be seen in Figures 5.9, 5.10 and 5.12. Contrary to the two previously presented experiments, these results do not show as clear of a trend between reward and DOF.

However, another interesting phenomenon arises, all of the distribu- tions exhibits some degree of bimodality as seen in Figure 5.9. This is particularly prevalent for agents controlling three, four and six DOF. This tendency can also be seen in Figure 5.12 and seem to be related to the D parameter value. Especially in the case of six DOF, two zones seem to have formed. One in D [0.1, 0.5] and another in D [0.6, 1.0]. 2 2 This divides the spectrum of parameter values into one well perform- ing and one poorly performing zone. This suggests that the agent is able to compensate for a poorly performing collaborator to some ex- tent, and even enhance their collaborative performance over a large range of collaborators. However, when the collaborator is performing too poorly, the policy can not work to improve the performance to the same extent. Worth noting is the fact that the six DOF policy in col- laboration with any collaborator seems to approximately maintain or improve the reward achieved by the collaborator alone.

Another factor that should be accounted for is that the range of param- eter, which the agents were trained over in this experiment, is much larger than that of the previous experiments. Thus, only a few learn- ing samples are gathered from each combination of parameters.

Bad Collaborator To evaluate the agents ability to learn when only training with poorly performing collaborators, a range of P and D values in which the col- laborators perform poorly is used. The results obtained from this ex- periment can be seen in Figures 5.13, 5.11 and 5.14.

The results of this experiments clearly show that a poorly chosen train- ing partner does have a negative effect on the end performance. The performance of some of the most poorly performing collaborators is 56 CHAPTER 5. EXPERIMENTS

improved as seen in 5.14 (right), which also overlaps with the range of training parameters used. However, this is not unique to the agents trained with poorly performing collaborators, rather this is a trend which can be seen in the other experiments as well, see Figures 5.5 (right), 5.8 (right), 5.12 (right).

5.1.5 Performance of Robot-Robot Collaborator The size of the action space the agent control and operate in can have significant effect on the performance, generally a large action space have proven challenging for a number of reason. To test the PPO algo- rithms ability to learn in a larger action space, a single PPO controlled agent is in charge of both robot arms working to solve the task. Each arm operate with five DOF, resulting in a ten dimensional action space.

This experiment was not only carried out in a larger action space, the observation space was also expanded to include all data available from both robots, contrary to the previous experiments in which the obser- vation space did not include any data from the collaborator.

The results from this experiments can be seen in Figure 5.15 and 5.16. The distribution plot can be compared to that of the DRL balancing alone in Figure 5.1. One can see that the two robots working together, while both are controlled by the PPO based agent, performs worse than the one acting alone in four and five DOF. While the added num- ber of DOF offer more physical control for the agent, it also adds an- other level of complexity. The action spaces is doubled compared to the one controlled by the five DOF agent operating alone.

Although the agent performs worse than the five DOF agent acting alone, one can clearly see in Figure 5.16 that the agent is able to learn to improve the cumulative reward. The average reward is clearly above the average achieved by the baseline. CHAPTER 5. EXPERIMENTS 57

Figure 5.15: See Figure 5.1’s caption for common details. The red dis- tribution is based on the performance of two robots collaborating, be- ing controlled by a single PPO trained Policy Network.

Figure 5.16: The average reward achieved over the number of time steps. The PPO policy is trained without a PD collaborator, instead the agent controls both robots.

5.1.6 Performance of Human-Robot Collaborator with more Information To evaluate how the resulting policy was effected by the size of the observation space and the type of data available to the collaborator, the observation space was expanded to include the data from both robots. The expanded observation is hypothesized to improve the per- formance by providing more relevant information, thus more relevant 58 CHAPTER 5. EXPERIMENTS

Figure 5.17: See Figure 5.1’s caption for common details. The red dis- tribution is based on the performance of the PD controller collaborat- ing with the PPO. The observation space is larger in this experimental setting than in the experiment presented above. This PPO policy was learned using P [0.905, 1.135], D [0.15, 0.25] and v [ 1.50, 1.50]. 2 2 2 data is available for the agent to learn and base its policy on. How- ever, the expended observation might also prove a challenge, since the larger observation space makes it more complex to learn from, because the policy needs to learn a larger number of parameters and what data is relevant.

The results obtained from this experiment can be found in Figures 5.17 and 5.18. The experimental results show that the agent is able to learn. However, no major improvement of the reward is achieved. Several other experiments with varying levels of added or removed data was used as observation space. The results of these experiment yielded no interesting information and have therefore been omitted. CHAPTER 5. EXPERIMENTS 59

Figure 5.18: The average reward achieved over the number of time steps. The observation space is larger in this experimental setting than in the experiment presented above. This PPO policy was learned using P [0.905, 1.135], D [0.15, 0.25] and v [ 1.50, 1.50]. 2 2 2 Chapter 6

Discussion

In this chapter the results will be discussed. Both in general but also in regard to the research questions.

6.1 Conclusions

A number of conclusions can be drawn based on the experimental re- sults.

A DRL algorithm can be used to learn a robotic control policy • which performs the complex balancing task.

A DRL algorithm can be used to learn a policy that productively • collaborates with an emulated human.

A DRL based policy can be used to productively collaborate with • previously unknown collaborators, while only being trained with one collaborator.

The results suggest that the DRL based approach to this balancing problem can be handled by a DRL agent alone, collaboratively with a emulated human as well as collaborate with partners with previously unseen behaviour.

The first conclusion is drawn based on the results seen in Figure 5.1. The distribution plot shows that the five DOF agent perform at a level equal to or better than the best of the emulated human.

60 CHAPTER 6. DISCUSSION 61

The second and third conclusion is drawn based on the results seen in Figures 5.8 (right) and 5.12 (right). Both the experiments are based on the agents operating with six DOF. The agents are trained with a very small range of collaborators illustrated by the small green boxes, which encompass the range of parameter values used when training. The plots depict the relative difference in reward for each parameter combination, compared to the baseline. In both these plots the color range is between light and dark blue, indicating that the score is im- proved compared to the baseline for all parameter combinations, even the combinations that are not in the range used when training. These results indicate that the agents can learn to productively collaborate with an emulated human as well as generalize this ability to previ- ously unknown collaborators.

6.1.1 Research Questions The goal of this thesis is to explore the possibility of using DRL to learn a useful Robotic Control Policy which can collaboratively solve a task with a human. Based on the results the conclusion that DRL can be used for the task has been reached. However, this conclusion is based on the experiments which include a number of assumptions. Depend- ing on the validity of these assumptions the conclusions drawn might also be applicable to the case of a real physical robot.

The validity of the most central assumptions will be analyzed below. These have been identified as the reliability of the simulation model, the resemblance of the simulated human movement to actual human movement and to what extent the observational data available in the simulator can be realized on a physical robot.

Authenticity of Simulation All agents in the simulator operate on the Fetch Robot, which is an existing physical robot. The same MuJoCo model of the robot has previously been used to successfully transfer a policy learned in the simulator to an identical physical system [2]. This suggest that also this policy have a high likelihood of being transferable to a physical system and that the simulation is realistic enough for the task. 62 CHAPTER 6. DISCUSSION

Resemblance between PD Controller and Human Movement Due to the fact that these experiments are carried out in simulation, human involvement have not been used, instead a PD controller have been used to simulate human movement. The validity of this assump- tion can be questioned. One might argue that any conclusion about the accuracy of this assumption can not be made unless tested on a physical robot. However, PD-controllers are based on feedback con- trol loops which emulates human control strategy. This is a factor that indicate that while the PD controller might not perfectly emulate a hu- man collaborator, the similarity between control strategies might be enough.

Sensor Data All the data in the observation space was chosen such that it would be possible to attain it on a physical system. The sensor data used is primarily related to the robots movement or that of the ball and table. The data related to the robot can be directly accessed from the Fetch Robot, with high accuracy. The data related to the table and ball, however, require other methods to be accessible. Although this is not as straight forward as in the robot related data, available methods used in other similar experiments should be able to provide the data, although this data might be more noisy and include delays that needs to be predicted. This is not certain and have to be tested to conclusively tell if so is the case.

6.1.2 Unanswered Questions One goal of this thesis was for the DRL agents interactions to be per- ceived as natural for the human collaborator. However, what can be considered natural, might vary from human to human and is hard to measure quantitatively, especially since no human have actively in- teracted with the agents. Nevertheless, a component of natural HRC is goal-oriented cognition, basically inducing a feeling in the human collaborator that the robot is working towards the same goal. This has previously been done by encoding the goal of the human in the reward function when framing the RL problem. These ideas were also used in this thesis to induce a partial natural feel, by basing the reward on the distance error compared to the human goal. CHAPTER 6. DISCUSSION 63

The last goal was to evaluate the impact of different sensor modalities on the performance of the agents. Several experiments were carried out to evaluate this, however, without success. Therefore they were omitted from this report. The only experiments testing the impact of modalities carried out with meaningful results were those seen in Fig- ure 5.17 and 5.18. However, these results were inconclusive and al- though a direct comparison can be done, more thorough experiments need to be carried out to be able to draw conclusions. This will be discussed in the following section.

6.1.3 Limitations and Improvements Several of the experiments exhibit signs of certain trends, for exam- ple, how one parameter setting seem to outperform another or how the DOF effect the expected cumulative reward. However, while the experiments seem to show that these connections exist, these experi- ments need to be carried out several times, such that the results can be proven to be statistically significant.

Beyond the need to repeat the experiments, a number of improve- ments could be done to perfect the performance of the agent. Among these is the design and hyper-parameters chosen for the DNNs used as function approximators, especially performing thorough hyper-parameter searches, adaptive learning and early stopping, such that the best pos- sible functions approximators for each experiment is used and evalu- ated when performing at its peak.

Another area in which the experiment could be improved is by intro- ducing the possibility of evaluating how natural the movement pat- tern learned by the robot is. Although the best way to do this is by asking humans who have actively collaborated with an agent, other methods such as comparing the movement pattern to models of hu- man movement, such as the minimum jerk models, could provide a way of quantifying the naturalness of an interaction. Chapter 7

Conclusion

In this work the possibility of using DRL for human robotic collabo- ration was evaluated. A collaborative task was formulated in which a human and a robot were jointly tasked with balancing a ball on top of a table along a predefined path. The experiments were carried out in simulation, in which the human was represented by a robot and its movement pattern emulated using a PD controller. The robot was trained over a million time steps using the PPO algorithm. The train- ing was carried out both with and without collaborators of varying skill level.

The results indicate that the polices learned with the PPO algorithm can learn to collaboratively perform the task better than any of the col- laborators can alone. Further, the results indicate that agents trained with only one collaborator can generalize and improve the perfor- mance of a wide range of collaborators, with varying movement pat- terns.

While the results are promising for the application of DRL for RCP on Human Robotic Collaboration problems, a number of likely assump- tions are made. The validity of these assumptions limit the conclu- sions that can be drawn regarding the use of the policies on physical systems.

64 Bibliography

[1] Don Joven Agravante et al. “Collaborative human-humanoid car- rying using vision and haptic sensing”. In: Robotics and Automa- tion (ICRA), 2014 IEEE International Conference on. IEEE. 2014, pp. 607–612. [2] Marcin Andrychowicz et al. “Hindsight Experience Replay”. In: Advances in Neural Information Processing Systems 30: Annual Con- ference on Neural Information Processing Systems 2017, 4-9 Decem- ber 2017, Long Beach, CA, USA. 2017, pp. 5055–5065. URL: http: //papers.nips.cc/paper/7090-hindsight-experience- replay. [3] Greg Brockman et al. “OpenAI Gym”. In: CoRR abs/1606.01540 (2016). arXiv: 1606.01540. URL: http://arxiv.org/abs/ 1606.01540. [4] Antoine Bussy et al. “Proactive behavior of a in a haptic transportation task with a human partner”. In: RO- MAN, 2012 IEEE. IEEE. 2012, pp. 962–967. [5] Alexandre Campeau-Lecours, Martin J-D Otis, and Clément Gos- selin. “Modeling of physical human–robot interaction: Admit- tance controllers applied to intelligent assist devices with large payload”. In: International Journal of Advanced Robotic Systems 13.5 (2016), p. 1729881416658167. DOI: 10.1177/1729881416658167. eprint: https://doi.org/10.1177/1729881416658167. URL: https://doi.org/10.1177/1729881416658167. [6] Prafulla Dhariwal et al. OpenAI Baselines. https://github. com/openai/baselines. 2017. [7] P.Evrard and A. Kheddar. “Homotopy switching model for dyad haptic interaction in physical collaborative tasks”. In: World Hap- tics 2009 - Third Joint EuroHaptics conference and Symposium on

65 66 BIBLIOGRAPHY

Haptic Interfaces for Virtual Environment and Teleoperator Systems. Mar. 2009, pp. 45–50. DOI: 10.1109/WHC.2009.4810879. [8] Chelsea Finn et al. “Deep spatial autoencoders for visuomotor learning”. In: 2016 IEEE International Conference on Robotics and Automation, ICRA 2016, Stockholm, Sweden, May 16-21, 2016. 2016, pp. 512–519. DOI: 10.1109/ICRA.2016.7487173. URL: https: //doi.org/10.1109/ICRA.2016.7487173. [9] Ali Ghadirzadeh et al. “A sensorimotor reinforcement learning framework for physical Human-Robot Interaction”. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea, October 9-14, 2016. 2016, pp. 2682– 2688. DOI: 10 . 1109 / IROS . 2016 . 7759417. URL: https : //doi.org/10.1109/IROS.2016.7759417. [10] Ali Ghadirzadeh et al. “Deep predictive policy training using reinforcement learning”. In: 2017 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017. 2017, pp. 2351–2358. DOI: 10 . 1109/IROS.2017.8206046. URL: https://doi.org/10. 1109/IROS.2017.8206046. [11] Elena Gribovskaya, Abderrahmane Kheddar, and Aude Billard. “Motion learning and adaptive impedance for dur- ing physical interaction with humans”. In: IEEE International Con- ference on Robotics and Automation, ICRA 2011, Shanghai, China, 9- 13 May 2011. 2011, pp. 4326–4332. DOI: 10.1109/ICRA.2011. 5980070. URL: https://doi.org/10.1109/ICRA.2011. 5980070. [12] Shixiang Gu et al. “Deep reinforcement learning for robotic ma- nipulation with asynchronous off-policy updates”. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE. 2017, pp. 3389–3396. [13] Herke van Hoof et al. “Stable reinforcement learning with au- toencoders for tactile and visual data”. In: Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE. 2016, pp. 3928–3934. [14] Dan Horgan et al. “Distributed Prioritized Experience Replay”. In: International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=H1Dy---0Z. BIBLIOGRAPHY 67

[15] Stephen James, Andrew J. Davison, and Edward Johns. “Trans- ferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task”. In: 1st Annual Conference on Robot Learning, CoRL 2017, Mountain View, California, USA, November 13-15, 2017, Proceedings. 2017, pp. 334–343. URL: http://proceedings. mlr.press/v78/james17a.html. [16] Jens Kober and Jan R. Peters. “Policy Search for Motor Prim- itives in Robotics”. In: Advances in Neural Information Process- ing Systems 21. Ed. by D. Koller et al. Curran Associates, Inc., 2009, pp. 849–856. URL: http://papers.nips.cc/paper/ 3545 - policy - search - for - motor - primitives - in - robotics.pdf. [17] K Kosuge, H Yoshida, and T Fukuda. “Dynamic control for robot- human collaboration”. In: Robot and Human Communication, 1993. Proceedings., 2nd IEEE International Workshop on. IEEE. 1993, pp. 398– 401. [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Im- agenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097– 1105. [19] Jörg Krüger, Terje K Lien, and Alexander Verl. “Cooperation of human and machines in assembly lines”. In: CIRP annals 58.2 (2009), pp. 628–646. [20] Sergey Levine and Vladlen Koltun. “Guided policy search”. In: International Conference on Machine Learning. 2013, pp. 1–9. [21] Sergey Levine et al. “End-to-End Training of Deep Visuomotor Policies”. In: CoRR abs/1504.00702 (2015). arXiv: 1504.00702. URL: http://arxiv.org/abs/1504.00702. [22] Sergey Levine et al. “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection”. In: CoRR abs/1603.02199 (2016). arXiv: 1603.02199. URL: http: //arxiv.org/abs/1603.02199. [23] Timothy P. Lillicrap et al. “Continuous control with deep re- inforcement learning”. In: CoRR abs/1509.02971 (2015). arXiv: 1509.02971. URL: http://arxiv.org/abs/1509.02971. 68 BIBLIOGRAPHY

[24] Yusuke Maeda, Takayuki Hara, and Tamio Arai. “Human-robot cooperative manipulation with motion estimation”. In: Intelli- gent Robots and Systems, 2001. Proceedings. 2001 IEEE/RSJ Inter- national Conference on. Vol. 4. Ieee. 2001, pp. 2240–2245. [25] Volodymyr Mnih et al. “Asynchronous methods for deep rein- forcement learning”. In: International conference on machine learn- ing. 2016, pp. 1928–1937. [26] Volodymyr Mnih et al. “Human-level control through deep re- inforcement learning”. In: Nature 518.7540 (2015), p. 529. [27] Ashvin Nair et al. “Overcoming Exploration in Reinforcement Learning with Demonstrations”. In: CoRR abs/1709.10089 (2017). arXiv: 1709.10089. URL: http://arxiv.org/abs/1709. 10089. [28] Matthias Plappert et al. “Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research”. In: CoRR abs/1802.09464 (2018). arXiv: 1802.09464. URL: http: //arxiv.org/abs/1802.09464. [29] John Schulman et al. “Proximal Policy Optimization Algorithms”. In: CoRR abs/1707.06347 (2017). arXiv: 1707.06347. URL: http: //arxiv.org/abs/1707.06347. [30] John Schulman et al. “Trust Region Policy Optimization”. In: CoRR abs/1502.05477 (2015). arXiv: 1502.05477. URL: http: //arxiv.org/abs/1502.05477. [31] David Silver et al. “Deterministic policy gradient algorithms”. In: ICML. 2014. [32] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforce- ment Learning. 1st. Cambridge, MA, USA: MIT Press, 1998. ISBN: 0262193981. [33] Josh Tobin et al. “Domain randomization for transferring deep neural networks from simulation to the real world”. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017. 2017, pp. 23–30. DOI: 10.1109/IROS.2017.8202133. URL: https: //doi.org/10.1109/IROS.2017.8202133. BIBLIOGRAPHY 69

[34] E. Todorov, T. Erez, and Y. Tassa. “MuJoCo: A physics engine for model-based control”. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Oct. 2012, pp. 5026–5033. DOI: 10.1109/IROS.2012.6386109. [35] Astrid Weiss, Daniela Wurhofer, and Manfred Tscheligi. ““I love this dog”—children’s emotional attachment to the robotic dog AIBO”. In: International Journal of Social Robotics 1.3 (2009), pp. 243– 248. [36] Bryan Whitsell and Panagiotis Artemiadis. “Physical human– robot interaction (pHRI) in 6 DOF with asymmetric coopera- tion”. In: IEEE Access 5 (2017), pp. 10834–10845. [37] Ronald J Williams and Jing Peng. “Function optimization using connectionist reinforcement learning algorithms”. In: Connection Science 3.3 (1991), pp. 241–268. [38] Melonee Wise et al. “Fetch and freight: Standard platforms for applications”. In: Workshop on Autonomous Mobile Service Robots. 2016. [39] Tytus Wojtara et al. “Human-robot collaboration in precise posi- tioning of a three-dimensional object”. In: Automatica 45.2 (2009), pp. 333–342. DOI: 10.1016/j.automatica.2008.08.021. URL: https://doi.org/10.1016/j.automatica.2008. 08.021. [40] Fangyi Zhang et al. “Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control”. In: CoRR abs/1511.03791 (2015). arXiv: 1511.03791. URL: http://arxiv.org/abs/ 1511.03791. Chapter 8

Appendix

70 CHAPTER 8. APPENDIX 71

Name Site Type Unit Description Desired position Goal Position NA 2-D Position m relative to boards coordinate system Position relative to Ball Position Center of Ball 2-D Position m boards coordinate system The balls velocity Ball Velocity Center of Ball 2-D Velocity m/s relative to the boards coordinate system Distance to center Ball Distance Center of Ball 1-D Distance m of Board Relative to Board Position Center of Board 3-D Position m the world coordinates The boards velocity Board Velocity Center of Board 3-D Velocity m/s relative to the world coordinates Rotation of board Board Rotation Center of Board 3-D Rotation Rad in world coordinates Board Angular velocity 3-D Angular Rotational Center of Board Rad/s relative to world Velocity Velocity coordinates Rotation of each Joint Rotation Joint Positions 15-D Rotation Rad joint relative to initial rotation Joint Rotation 15-D Angular Angular velocity Velocity Robot/ Joint Positions Rad/s Velocity at each joint Human Grip Position Center of Relative to 3-D Position m Robot/Human Gripper world coordinates Grip Velocity Center of Relative to 3-D Velocity m/s Robot/Human Gripper world coordinates Right Hand Side Force Sensor Force in each of Human Robot 3-D Force N Human xyz-direction Controll Bar Right Hand Side Force Sensor Force in each of Robot 3-D Force N Robot xyz-direction Controll Bar Right Hand Side Torque at Torque Sensor of Human Robot 3-D Torque Nm sensor position Human Controll Bar in local coordinates Right Hand Side Torque at Torque Sensor of Robot 3-D Torque Nm sensor position Robot Controll Bar in local coordinates

Table 8.1: Environment Sensor and Data Points

TRITA -EECS-EX-2019:85

www.kth.se