UPTEC F 18008 Examensarbete 30 hp April 2018

A Deep Framework where Agents Learn a Basic form of Social Movement

Erik Ekstedt Abstract A Deep Reinforcement Learning Framework where Agents Learn a Basic form of Social Movement Erik Ekstedt

Teknisk- naturvetenskaplig fakultet UTH-enheten For social robots to move and behave appropriately in dynamic and complex social contexts they need to be flexible in their movement Besöksadress: behaviors. The natural complexity of social interaction makes this a Ångströmlaboratoriet Lägerhyddsvägen 1 difficult property to encode programmatically. Instead of programming Hus 4, Plan 0 these algorithms by hand it could be preferable to have the system learn these behaviors. In this project a framework is created in which Postadress: an agent, through deep reinforcement learning, can learn how to mimic Box 536 751 21 Uppsala poses, here defined as the most basic case of social movements. The framework aimed to be as agent agnostic as possible and suitable for Telefon: both real life robots and virtual agents through an approach called 018 – 471 30 03 "dancer in the mirror". The framework utilized a learning algorithm

Telefax: called PPO and trained agents, as a proof of concept, on both a virtual 018 – 471 30 00 environment for the humanoid robot Pepper and for virtual agents in a physics simulation environment. The framework was meant to be a simple Hemsida: starting point that could be extended to incorporate more and more http://www.teknat.uu.se/student complex tasks. This project shows that this framework was functional for agents to learn to mimic poses on a simplified environment.

Handledare: Alex Yuan Gao Ämnesgranskare: Ginevra Castellano Examinator: Tomas Nyberg ISSN: 1401-5757, UPTEC F 18008 Popul¨arvetenskaplig sammanfattning

M¨ansklighetenblir alltmer beroende av teknologi och utvecklingen g˚arsnabbare ¨ann˚agonsinf¨orr. F¨or ett decennium sedan s˚ablev den f¨orstasmartphonen in- troducerad och plattformar som Facebook och Youtube d¨okupp f¨oratt f¨or¨andra samh¨alletf¨oralltid. P˚agrund av hur fort teknologin utvecklas ¨ardet fullt m¨ojligt att vi snart lever i ett samh¨alled¨arsociala robotar k¨annslika sj¨alvklarasom v˚ara smartphones g¨oridag. Robotar som kan hj¨alpaoss med allt fr˚ans¨allskap till sjukv˚ard,r¨addningstj¨anstoch utbildning.

N¨arm¨anniskor interagerar och kommunicerar i v˚aravardagliga liv, det vill s¨aga n¨arvi tr¨affasi det verkliga livet, anv¨andervi oss mycket av gester och r¨orelser.Vi r¨oross p˚aolika s¨attberoende p˚avilken social umg¨angeskrets vi ¨armed eller vad f¨orslags social situation som ¨arrelevant. Det ¨arannorlunda att se n˚agonm¨ota sina b¨astav¨annerinne p˚aen fest eller g˚aut fr˚anen begravning. Vi anv¨ander v˚artkroppsspr˚aktill att f¨ortydliga vad vi menar och vi kan bed¨omaandras sin- nesst¨amningfr˚ananalys av deras h˚allning och s¨attetde f¨orsig. Om sociala rob- otar ska vara en naturlig del av samh¨alletoch interagera och kommunicera med oss m¨anniskor vore det f¨ordelaktigt om dessa hade liknande egenskaper. Sociala robotar borde kunna r¨orasig p˚aett naturligt s¨attsom tillf¨orn˚agottill den so- ciala interaktionen och g¨oratt m¨anniskor k¨annersig lugna och s¨akra. Deras tillv¨agag˚angss¨attborde ¨andrasberoende p˚ahur andra i det sociala sammanhanget beter sig. Sociala situationer ¨arav en dynamisk natur som g¨ordet sv˚artatt p˚a f¨orhandprogrammera in den exakta kunskap som kr¨avsf¨oratt r¨orasig p˚aett, f¨or m¨anniskor, ¨overtygande s¨att.Ist¨alletf¨oratt best¨ammahur en robot ska bete sig och programmera in olika typer av r¨orelservore det b¨attreom roboten sj¨alvl¨arde sig detta.

P˚asenare ˚arhar , ett omr˚adeav maskininl¨arningsom anv¨andersig av neurala n¨atverk, visat stora framsteg inom m˚angaolika omr˚aden. AI ¨arett popul¨arkulturellt begrepp och f˚armycket utrymme i media. Det kan vara r¨on om allt fr˚ansj¨alvk¨orande bilar, personliga assisstenter till cancerdiagnosterande system och i de flesta fallen ¨ardet deep learning och neurala n¨atverk som ¨arden underliggande teknologin. Neurala n¨atverk har funnits sedan 40-talet men det ¨ar under de senaste ˚arende har blivit mainstream. Det ¨arf¨orstidag vi har tillr¨ackligt med ber¨akningskrafttillg¨angligt f¨ortillr¨ackligt m˚anga m¨anniskor som dessa neu- rala n¨atverk har kunnat ge de resultat vi nu ser ¨arm¨ojliga.Dessa typer av program ¨arnu standard i allt fr˚anljud- och bildigenk¨anningtill att att ¨overs¨attatext mel- lan olika spr˚ak.Det ¨ar¨aven denna teknologin som ligger bakom de program som nu ¨arb¨attre¨anm¨anniskor p˚aspel som Go, Atari och schack. Dessa program har l¨artsig spela dessa spel genom en teknik, p˚aengelska kallad reinforcement learn-

i ing. Denna teknik handlar om att l¨arasig beteende p˚aliknande s¨attsom djur och m¨anniskor l¨arsig.

Inom reinforcement learning s˚aanv¨andsuttryck som agent, milj¨ooch bel¨oning.En agent interagerar med sin milj¨od¨arolika handlingar ger olika bel¨oningarberoende p˚ahur bra handlingen var. Agenten testar sedan att g¨oramassvis med olika han- dlingar och efter en viss m¨angdtr¨aningl¨arden sig vad som ¨arb¨astatt g¨oraoch vad som b¨orundvikas. Detta ¨argenerellt och de beteenden som agenten l¨arsig beror p˚amilj¨on,bel¨oningenoch inl¨arningsalgorithmen. Olika milj¨oermed olika bel¨oningssystem ger upphov till agenter som ¨arbra p˚aolika saker.

I detta projekt skapas en milj¨omed tillh¨orandebel¨oningssystemsom ¨artill f¨oratt en agent ska l¨arasig att h¨armaen annan agents kroppsh˚allning.Att h¨armaen annan agents kroppsh˚allningantas i detta projekt vara den mest element¨araformen av sociala r¨orelser.Planen ¨arsedan att utg˚afr˚andetta och introducera mer och mer komplexa uppgifter. F¨orutommilj¨ons˚aanv¨andesen ny optimeringsalgoritm, f¨orkortad som PPO, f¨oratt optimera de neurala n¨atverk som var skapade f¨oratt l¨osauppgiften. I denna implementation ¨ardet viktigt att milj¨on¨argenerell f¨oratt kunna tr¨anadels helt fiktiva virtuella figurer men ocks˚ariktiga robotar s˚asom den humanoida roboten Pepper fr˚anSoftbank Robotics. Projektet implementerade en milj¨obaserat p˚aProgrammet Choregraphe d¨arman kan styra Pepper samt en milj¨osom ¨arbaserat p˚anon profit-f¨oretagetOpenAI’s Roboschool-milj¨obyggt p˚a fysik-simuleringsprogram Bullet. Det de olika milj¨oernahar gemensamt ¨ars¨attet agenter i milj¨onska l¨arasig akten att h¨armaen annan agents kroppsh˚allning. Efter det att milj¨oernablev funktionella s˚autf¨ordesn˚agramindre omfattande experiment f¨or att se om algoritmen, milj¨on,bel¨oningssystemetoch de neurala n¨atverken kunde visas klara uppgiften att h¨armaen annan agents kroppsh˚allning. Resultaten fr˚andessa mindre experiment visar p˚aatt det ¨arm¨ojligtatt h¨arma kroppsh˚allningp˚adetta s¨attet,i en f¨orenkladmilj¨o,men att mer arbete beh¨ovs f¨oratt g¨oramilj¨oernamer komplexa och relevanta f¨orrealistiska situationer.

ii TABLE OF CONTENTS

1 Introduction 1 1.1 Setup ...... 2 1.2 Dancer in the Mirror Approach ...... 3 1.3 Research Questions ...... 5

2 Background 5 2.1 ...... 5 2.2 Artificial Neural Networks ...... 6 2.3 Activation Functions ...... 7 2.3.1 Sigmoidal ...... 7 2.3.2 ReLu ...... 8 2.4 Backpropogation ...... 9 2.4.1 Stochastic ...... 9 2.4.2 Adam ...... 10 2.5 Architectures ...... 10 2.5.1 Convolutional Neural Network ...... 11 2.5.2 Recurrent Neural Networks ...... 11 2.5.3 Hyperparameters ...... 12 2.6 Reinforcement Learning ...... 12 2.6.1 Value iteration ...... 15 2.6.2 Policy Optimization ...... 15 2.6.3 Actor-Critic Methods ...... 17 2.6.4 Proximal Policy Optimization ...... 17 2.6.5 Exploration vs Exploitation ...... 18 2.7 Pepper ...... 19 2.7.1 Choregraphe ...... 19 2.8 OpenAI’s Gym ...... 20 2.8.1 Roboschool ...... 21

3 Method 21 3.1 Learning Algorithm ...... 22 3.2 Pepper Environment ...... 24 3.3 Custom Roboschool Environment ...... 25 3.4 Reward Function ...... 26 3.5 Networks ...... 28 3.5.1 Modular Approach ...... 28 3.5.2 Semi Modular Approach ...... 29 3.5.3 Combined Approach ...... 30 3.6 Experiment ...... 30

iii 3.7 Custom Reacher Experiments ...... 31 3.8 Custom Humanoid Experiments ...... 32 3.9 Pepper Experiments ...... 33

4 Results 34 4.1 Reward Function ...... 34 4.2 Experiment ...... 34 4.2.1 Reacher Environment ...... 35 4.2.2 Humanoid Environment ...... 39 4.3 Pose Evaluation ...... 39 4.4 Pepper ...... 41 4.5 Code ...... 44

5 Discussion and Future Work 45 5.1 Custom Roboschool Environment ...... 45 5.2 Pepper ...... 46 5.3 Project ...... 48 5.4 Future Work ...... 49

6 Conclusion 49

iv 1 Introduction

This project aims to construct a framework for training agents to learn a basic form of social movement through end to end deep reinforcement learning. In human so- cial interactions individuals convey a lot of information through the movements of different body parts. We implement many detailed movements in the facial area and in the use of our arms, hands and overall posing. There are a wide variety of different movements humans use when we engage in social interactions and they range from fully conscious and explicit in their meanings, all the way to movements that we are unconsciously doing and are not aware of. We use the movement information of others as a way to infer the type of social interaction we are in as well as the emotional state and intentions of the people we socialize with. Social movements are highly context dependent and the context change over time. In other words the contexts are dynamical and require that an agent is able to adapt to different behaviors based on queues in the social environment. Be- cause of these nuances of movements in human social interaction it is very hard to write programs that could simulate similar behavior for any virtual or actual robot.

In order to program these sort of movements in machines the types of movements humans implements have to be defined but also when those movements are imple- mented and in what context. This makes the problem highly complex and could seem quite intractable to solve. Instead of trying to define everything in advance it would be preferable to have the robot learn these movements by itself. In recent years deep learning, a subset of machine learning, has shown great proficiency in a wide range of tasks. In 2012 Alexnet [1] won the competition [2, 3] and deep learning algorithms has dominated the field of ever since. These algorithms has been shown to be able to play Atari games [4] and achieve superhuman performance in Go and Chess [5, 6] and has improved the state of the art in fields like machine translation [7] and [8] to name a few. Deep learning algorithms are general in nature and learns by training on large amount of data. The behaviors learned are both dependent on the type of data but also how the learning problem is stated. This project states the learning as a reinforcement learning problem where an algorithm learns by implementing actions in a trial and error fashion. An agent implements different actions inside of an environment and this environment simulates the consequences of the action and returns a reward. Based on this reward the algorithm updates and the process is repeated.

There are many available environments that focuses on games such as ALE [9], pygame [10], Starcraft [11], Deeplab [12] but also environments such as Psych- lab [13], House3d [14] and dialog management[15], which are research specific to

1 certain fields of research. A combination of the kinds of environments above is found in OpenAI’s Gym [16] which is an attempt to make these kinds of environ- ments more accessible to practitioners. In the future, when learning algorithms become better and more efficient, it will be the learning frameworks that will de- fine what behaviors systems may acquire. This project aims to implement a basic starting point for an environment with a focus on social movements.

This project creates a basic framework consisting of custom made environments with associated rewards, an implementation of a reinforcement learning algorithm and some basic neural network architectures for the goal of learning social move- ments. The framework will have a general design to make it as agent agnostic as possible by an approach referred to as dancer in the mirror. The term agent agnostic is used throughout the thesis and for a framework to have this property any agent should be able to learn the specific task in the same way. The specific traits of any agent should not effect the training at large. The main consideration is that the learning should depend only on the agent in question without humans or labeled data in the loop. The framework consists of environments for both the humanoid robot Pepper [17], using Choregraphe [18], and a physics simulation en- vironment based on OpenAI’s Roboschool [19]. This project shows that the dancer in the mirror approach is functional as a basic framework for social movements by having a simplified agent learn to mimic poses. The project also indicates that this setup works for more complex agents such as Pepper and a virtual humanoid torso given more time and effort for training.

1.1 Setup Social movements are complex in general and therefore the problem is simplified by defining some minimum requirements as a basis. Coordination is needed in order to move in a controlled way to any target state, meaning that if an agent is to raise its arm, by virtue of coordination, it is able to do so. The other ability that is required for social movements is a notion of understanding or perception. In all social context it is very important to be able to perceive what the other agents are doing, so as to extract meaning in order to implement appropriate responses. Language is very important in order to act accordingly in any interaction where communication is essential, but for social movements there are many contexts in which explicit communication may not be present. In this project the language aspect of social interactions is omitted and understanding is purely based on vi- sual perception. Thus the two basic abilities an agent utilizes in this project are coordination and vision.

2 Agents can be of many shapes and forms and it is preferable if the learning frame- work is as agent agnostic as possible. To achieve this little or no prior knowledge about the specific agents, concerning movement or vision, should be assumed. The learning loop should not depend on other actors such as humans or on la- beled data. Because of the complexity of the problem and the many variables that are combined to make up social movements, the design choices of the framework needs to be carefully considered. By starting very simple and then incrementing the complexity of the learning framework some problems might be avoided. In human interactions it is common to mimic poses and gestures [20] and the ability to mimic a pose could be seen as a first step in learning social movements. In this project the ability to mimic others, mainly the ability to reach the same pose as a target agent, is the first form of social movement to learn. In order to learn this ability for the general case a first step is arguably to learn how to achieve poses based on self training without any other agents in the loop and this is the goal of the dancer in the mirror approach. This may then be made more complex by instead targeting a sequence of poses (movements) and combine several sequences and add other actors etc.

1.2 Dancer in the Mirror Approach In this project the setup will be analogous to that of a dancer practicing in front of the mirror, see Figure 1. The dancer initiate movements in order to reach a certain pose, observes how this is externally perceived and, by having a target pose in mind, modifies the movement until the desired pose has been learned. By simultaneously observing the external perception of the poses and the movements performed the dancer learns how certain internal actions, the muscle control, cor- relate with certain external observations. After sufficient amount of training in front of the mirror the dancer generalizes this ability to then be able to mimic other dancers. The strength of this approach is that the setup is the same for any type of agent, virtual as well as real, and is not dependent on any specific abilities of the agent. The idea is that this setup may be implemented virtually in some simulation environment as well as in real life for actual physical robots. This setup is not dependent on the agent specific details of movement or appearance but could work across a wide range of agents.

To create a dancer in the mirror setup an agent needs the ability to move, to perceive its external state and have access to a target pose. A pose is the con- figuration an agent is in and is explicitly defined in the internal joint state but may also be inferred from an external observation. All agents that can move has access to its internal joint states and in this context this refers to the to the virtual

3 muscles of the agent, the actuators or motors. This means that movements are performed in the space spanned by these values. To reach a pose is defined as, given an internal state representation of a target, perform actions in the internal state space, and by some degree of precision, reach the configuration specified by the targe pose. This is the ability of coordination. The ability to translate an ex- ternal observation of another agent’s pose into the correlating internal joint state representation is referred to as understanding. Given an observation of another agent the understanding ability translates the external observation into an internal state representation and the coordination ability implements suitable movements in order to reach that pose. To successfully implement the two abilities to reach the target pose is what this project defines as pose mimicking.

The training is implemented by defining a set of target poses, initialize the agent, implement actions and receive rewards. After a certain amount of time the target may be switched and the process repeated. During training all of the data is known, the agent trains with itself, but during testing the agent mimics a target pose without having access to the targets internal state.

Figure 1: A dancer practicing in front of a mirror. A dancer implements move- ments, perceives an external observation of itself and with a target pose in mind executes movements to reach the desired pose. Photo credit: Mait J¨uriadoBalle- rina via photopin. License.

4 1.3 Research Questions This project aims to create suitable environments for agents in which to train and choose a learning algorithm along with neural network architectures. Then inves- tigate if the dancer in the mirror approach is valid as a basic learning framework for social movements.

• Can the dancer in the mirror approach train agents to learn coordination and mimic poses?

• Can the humanoid robot Pepper learn to mimic poses in this way?

• And can this be generalized to mimic dynamic motions and more complex behaviors?

First the background material for the relevant concepts and algorithms chosen are summarized. Then the method section goes through what was done and what experiments were run, the results of which are shown in the result section. There- after a discussion about the results, lessons learned, problems that arose and future work.

2 Background

In this section different concepts, algorithms and programs relevant to this project are explained. This sections starts with machine learning in general and then specifically machine learning based on neural networks, how neural networks are built and how they are trained. This is followed by a summary of reinforcement learning, what types of learning algorithms there are and then a more detailed explanation of the specific algorithm used in this project. The background section ends with summaries of the major softwares utilized namely Choregraphe and OpenAI’s Gym.

2.1 Machine Learning Machine learning is an area in computer science that refers to algorithms which learns to approximate a function from data. This is in contrast to ”regular” al- gorithms which are defined in a static way to process data as instructed by the programmer. Machine learning is generally clustered in to three different but over- lapping fields of study called , and rein- forcement learning. Supervised learning is for labeled data where all data points has a correlated label associated with it, the goal is then to find an algorithm which correctly maps the data points to the correct labels. Common supervised

5 learning problems are classification and regression. Unsupervised learning is the field of finding patterns or features in data with no associated labels. Common techniques are clustering algorithms and . The amount of data available is rapidly increasing and unlabeled data is the most common type, there- fore unsupervised learning is a very interesting field for the future. The final part of machine learning is referred to as reinforcement learning and gets its inspiration from learning behaviors observed in animals and humans, where a trial and error approach with associated rewards are used in order to construct algorithms. A common type of parametric function used in all fields of machine learning are neu- ral networks and this project focuses exclusively on this type of machine learning.

Deep learning has gotten a lot of attention recently and the name refers to neu- ral network algorithms that utilizes many layers. These types of algorithms has proved very efficient in supervised learning problems with large sets of data and many credit todays increasing attention for these algorithms after the ImageNet [2, 3] competition in 2012 where Krizhevsky et al. won with their deep neural network AlexNet [1].

2.2 Artificial Neural Networks Artificial neural networks (ANNs) are functions built up by interconnected neu- rons. Todays popular ANNs are constructed by a certain type of artificial neuron that receives a vector input x, the sum of which is fed through a non-linearlity called an activation function ϕ, to produce a scalar output y. For an N-dimensional input vector we have,

N ! X y = ϕ xi . (1) i=1 All neurons except the ones in the input , receive data from other neurons in a network and are weighted by a particular weight associated by the connection between the specific neurons. A network which only takes input and feeds it through the network in a straightforward manner is called a feed forward net. This is in contrast to recurrent network structures, which are explained later. The simplest kind of feed forward network is the multi layered (MLP). The term perceptron originates from one of the first artificial neural network algorithms with the same name. The perceptron is defined the same way as in equation 1 with a heavyside step function as the non-linearity, φ. In a MLP network all neurons in one layer is connected to all the neurons in the subsequent layer, this connection scheme is often referred to as ”fully connected”. There are weights

6 between specific connections in the network and these are commonly ordered in a matrix and multiplied by the vector output of one layer to produce the inputs to the next. Let W denote the matrix containing the weights in a layer and X be the input vector, b is a vector of bias values, and ϕ again is the activation function used by the neurons, then the vector output Y of a layer is defined as,

Y = ϕ(WX + b). (2) This output is then used as the input to another layer until the final output is reached. In Figure 2 a schematic diagram of a network with two layers is shown. Here the arrows between the neurons are the weights and the circles are the neurons. The goal is to update the weight parameters so the network converges towards the desried function.

Figure 2: A schematic representation of an artificial neural network. The circles are neurons and the arrows are weighted connections. [image source [21]]

2.3 Activation Functions The non-linearities in a neural network are required in order to model complex functions. Several linear layers connected together in the way described above are always mathematically equivalent to one linear layer. The first activation used in the perceptron algorithm from the late 1950’s was the heavyside or the threshold step function. Given any input the neuron would either be off or if the sum of the input was over a certain threshold the neuron would turn on. When neural networks got popular again in the 80’s common activation functions to use were the logistic function and the hyperbolic tangent function.

2.3.1 Sigmoidal Activation Function Sigmoid functions refer to functions of an s-shape form and includes functions such as the hyperbolic tangent and the logistic function. However, in machine

7 learning the logistic function showed below is commonly referred to as the sigmoid activation. This function is defined as 1 ex A (x) = = , (3) sigmoid 1 + e−x ex + 1 which has a range between (0, 1), ∀x ∈ R and the graph of the function is shown in Figure 3a. Another popular activation function is the hyperbolic tangent which, like the logistic function, is also of a sigmoidal shape. The hyperbolic tangent is defined by

ex − e−x A (x) = , (4) tanh ex + e−x with a range between (−1, 1), ∀x ∈ R and is shown in Figure 3b. Both the logistic and tanh functions are well suited for classification because of their steep gradient around zero. This property makes it easy for the behavior of the neuron to move away from zero to either side, making the behavior of it more distinct. They are both bounded, their ranges are both finite and ”small” meaning that the signals in the neural network wont blow up and become very large. However, the two functions have a really plane slope towards the larger negative and positive values, this makes it so that if the signal is far away from zero the gradient information becomes small. This problem is called the vanishing gradient problem and could make the training and convergence slow. They differ just slightly in their behavior mostly of the fact that tanh can output negative values and that its gradient is a little larger but apart from this they behave very similar.

2.3.2 ReLu In order to avoid the vanishing gradient problem and by inspiration from biol- ogy [22] the rectified linear unit ReLu, works differently. The ReLu activation function is a max operation and is linear for inputs larger than 0 and 0 elsewhere,

AReLu(x) = max(0, x). (5) The graph of the function is shown in Figure 3c. This activation function has an infinite range and therefore makes it possible for the signal in the neural network to become large. However, because of the zero output for negative inputs the signal in the network is sparse, meaning that not all neurons will feed the signal forward. In practice the ReLu activation has become very popular for convolutional neural networks.

8 (a) The sigmoid (logistic) (b) The hyperbolic tan- (c) The rectified linear function. gent function (tanh). unit (ReLu) activation.

Figure 3: Activation functions.

2.4 Backpropogation Artificial neural networks are parametric functions with an artificial neural compu- tational unit at its core. These neurons are connected to one and other in specific ways defined by the type of architecture that is implemented. But regardless of what type of architecture is being used all neural networks need to be trained in order to approximate the desired function. Training refers to updating the param- eters of a network in order to optimize for a specific function commonly referred to as a loss, cost or objective function. The most common approach is to optimize the weights by slightly permute the weight in the direction of the gradient of the with respect to the specific weight being updated. The algorithm for updating a neural network through gradient information is called backpropa- gation, referring to the way the error gradient ”flows” backwards from the final output layers of the network to the initial input layers. It is hard to explicitly say who invented the algorithm because the general optimization of using gradients in order to optimize some parametric function has been known in mathematics for a long time. For details regarding backpropagation or neural networks in general the interested reader is referred to this overview [23]. However, the basic approach of backpropagation algorithm is the stochastic gradient descent algorithm.

2.4.1 Stochastic Gradient Descent Neural networks are commonly optimized by gradient optimization algorithms. There are many different varieties but they are all based on the standard gradient descent algorithm referred to as stochastic gradient descent (SGD). In SGD the network trains on mini-batches or a subset of the total dataset meaning that only a sample of the true gradient is computed, hence the stochastic gradient. Let a neural network be parameterized by the vector θ and f(θ) be the objective function that is being optimized. The gradient of the objective function with respect to θ

9 then is given by ∇θf(θ). Then the update rule of for SGD is

t+1 t θi = θi − α∇θi f(θ), ∀θi ∈ θ, (6)

t where α ∈ R is the learning rate and θi refers to the i:th parameter being updated at time t.

2.4.2 Adam The basic SGD in section 2.4.1 is the foundation for many different optimization algorithms and a popular one is Adam [24]. Adam stands for adaptive moment estimation and is more complex than the regular SGD but often performs better. The algorithm estimates moving values for the first moment and second moment of the gradient. These are the parameters mt and vt where t denotes the time step of the stochastic function ft(θ). The moments are defined by

mt = β1 · mt−1 + (1 − β1) · ∇θft(θ), (7) 2 vt = β1 · mt−1 + (1 − β1) · ∇θft(θ) , (8) mt mˆt = t , (9) (1 − β1) vt vˆt = t (10) (1 − β2), where β1, β2 ∈ [0 1) are the exponential decay rates of the moments andm ˆt andv ˆt are the bias-corrected estimates. The explanation why the moments needs to be corrected is omitted in this summary and the interested reader is referred to the paper [24]. The final update rule for time step t is then defined as

mˆ t θt = θt−1 − α · √ , (11) vˆt +  where  is a small constant for computational stability. Informally Adam is efficient because it scales the step taken each update based on how noisy the gradient signal is. Where noisier signals yields a smaller step size.

2.5 Architectures The way neural network architectures work is by combining the basic computa- tional neuron, see equation 1, in different ways in order to optimize for specific uses. There are three prominent base architectures that are commonly used in ma- chine learning and the first of which is the fully connected architecture explained in section 2.2. The other two are convolutional and recurrent neural networks.

10 2.5.1 Convolutional Neural Network In computer vision convolutional neural network has shown great efficiency and performance and is the most common architecture to use when processing visual inputs. They are inspired by the mammalian visual cortex and differ from the fully connected architecture by only connecting certain neurons in one layer to specific neurons in the next. In two dimensional spatially correlated data, like the data in images and videos, the information in one part of the data arguably has little or nothing to do with data in other parts and this property is what convolutional networks utilize. Instead of connecting the output from one neuron to all neurons in the next layer only a smaller subset of the neurons in the next layer will be con- nected, a semi-connected neural network. However, convolutional neural network takes this one step further and argues that the processing being done in one part of the data is also useful in other areas and use a kernel of weights for all connec- tions. Figuratively one may imagine that a kernel with weights traverses the two dimensional data and taking the dot product between its weights and the subset of neurons it is hovering over. In more rigorous terms a convolutional operation is being done on the data. A schematic overview of this for a kernel size of 3x3 is shown in Figure 4a. The parameters in a convolutional layer are the size of the kernel, the stride and the number of kernels. The stride refers to the amount of values the kernel might skip. A stride of 1 calculates outputs for all values and a stride of 2 skips every other value and so on.

2.5.2 Recurrent Neural Networks Both the fully connected and the convolutional architectures are often implemented in a feed forward manner meaning that the output only depend on the current input. In other words, these structures have no explicit internal memory apart from the connections between the neurons. For sequential data this could be a problem because for such data there is a dependence between the data points in a sequence and by definition this can not be modeled without any memory. This is what recurrent architectures aim to fix. Instead of only having data propagate forward some data is stored internally for each neuron and is used in subsequent data passes, a qualitative schematic is shown in Figure 4b. This means that recurrent neural networks (RNNs) contain an internal memory property and is thus better at processing sequential data. The output of the network is not only dependent on the current input data but on all data previously processed by the network. This makes recurrent models more difficult to train because the error gradient must not only traverse from the output error back to the input layers but also back through time because of the internal memory structure.

11 (a) A layer with a 3 by 3 kernel. A schematic representation (b) A schematic representation of a re- current neural network. The circles are showing the output of the convolution neurons and the arrows are weighted for the dark blue value. connections. Notice the curved arrows pointing back on the neurons which rep- resent the internal memory of the RNN. [image source [25]] Figure 4: Two common neural architectures. The convolutional neural architecture in (a) and the recurrent neural architecture in (b).

2.5.3 Hyperparameters A neural network is defined by the types of basic architectures, explained above, used in the implementation but also of the size of the network, the number of layers, neurons and other relevant features. These are all parameter values that are set before training and are commonly referred to as hyperparameters.

2.6 Reinforcement Learning Reinforcement learning states the machine learning problem in terms of an agent interacting with an environment through actions. After implementing an action the agent receives information about how the environment is through an obser- vation and a reward. By repeatedly executing actions and receiving rewards the agent’s goal is to learn how to behave and what policy to implement in order to get the most rewards. A schematic overview is shown in Figure 5a. In practice it is most common to state these problems as a Markov decision process, MDP. Markov decision processes are a mathematical framework that defines how actions can transition an agent from one state to another. They are discrete time stochas- tic control processes meaning that the time is modeled as discrete and that the transitions between states in the MDP’s are stochastic in nature. Let S be a set of states s, A be a set of available actions a, and P (s0|s, a) is the transition probabili- ties of going from one state to another given an action. At time t the environment will be in a particular state, st, and an action, at, is implemented and then by some probability P (st+1|st, at) the environment transitions into st+1 and outputs a

12 reward, R(st+1, st, at, ...). A small schematic example is shown in Figure 5b. The reward is given by some reward function that could depend on the current state, the implemented action, the subsequent state and more. This reward is often denoted as rt for convenience. In short a MDP consists of four elements, a set of states S, a set of actions A, a state transition function P (s0|s, a) and a reward function R(·).

(b) A Markov Decision Process. The (a) A reinforcement learning diagram. green circles represents states and the The agent (robot) implements an action orange represents actions. The arrows in the environment and receives a re- represents probabilities and connects ward and an observation. actions to the resulting states.

Figure 5: Schematic representations of reinforcement learning and Markov Deci- sion Processes.

Reinforcement learning approaches can in general be divided into two major types, the model-based and model-free. In a model-based approach there is a module that models the environments in order to simulate what could happen in the fu- ture given a specific action. This model could then be used in order for the policy to decide what actions to implement in order to achieve the best result. This is in contrast to model-free approaches which uses no explicit modeling of the envi- ronment. Both model-based and model-free approaches consists of three common methods which are value-based, policy-based and actor-critic, shown in Figure 6.

In reinforcement learning there are some common functions which are defined below. An agent interacts with an environment over discrete time steps. Each time step t, the agent receives a state st ∈ S and implements an action at ∈ A chosen according to policy π : S → A in a stochastic manner at ∼ π(at|st). The agent then receives a scalar reward rt and the environment transitions to the next state st+1. This behavior is iterated until the agent reaches a terminal state. The total accumulated return from time t is given by

13 Figure 6: Two common reinforcement learning types are model based and model free. Both of these may utilize policy based, value based and actor-critic ap- proaches.

∞ X k Rt = γ rt+k, (12) k=0 where γ ∈ (0, 1] is referred to as the discount factor. The value function

π V (s) = E[Rt|st = s], (13) is the expected total return by following the policy π from state s. Similarly the action value function or Q-function

π Q (s, a) = E[Rt|st = s, at = a], (14) is the expected return from selecting action a in state s and then following the policy π. The optimal value functions are then given by

∗ π Q (s, a) = maxπQ (s, a), ∗ π (15) V (s) = maxπV (s), which are the maximum state-action value and value for any possible policy. In practice these functions are not known and is commonly approximated by a func- tion with parameters θ. The goal is to maximize the expected return from each state st with respect to the parameters θ.

14 2.6.1 Value iteration In value-based learning the goal is to approximate the optimal value function and then follow a greedy policy, choosing the action that yields the largest reward. Value iteration methods compute the optimal value function for a given MDP through an iterative process. These methods first initializes an arbitrary random starting function V0(s) which then is updated until the values for the states con- verges. It is common to use a temporal difference approach such as Q-learning in order to approximate the value function. Deep Q-learning got attention after Deepmind utilized it to solve several Atari environments [26]. However, other types of algorithms which utilizes policy methods or actor-critic methods has shown to be even more effective [27].

2.6.2 Policy Optimization Policy optimization approaches work in a different way than value-based methods and directly tries to find the best policy without a value function approximation. To do this it is common in practice to perform gradient ascent on the objective function by estimating the gradient. These approaches are called policy gradient methods. Here the policy πθ(a|s)is optimize directly. Let

" H # X t X U(θ) = E γ r(st); πθ = P r(τ; θ)R(τ), (16) t=0 τ be the total expected return from the environment conditioned on a policy πθ, for any trajectory τ of length H. The factor P r(τ; θ) is the probability of a certain trajectory given the parameters θ and R(τ) is the cumulative reward collected following τ . Then the objective becomes to maximize the function in Equation 16 with respect to the parameters of the network, X max U(θ) = max P r(τ; θ)R(τ). (17) θ θ τ

15 In order to optimize 17 the gradient with respect to θ is needed. The gradient is computed and algebraically restructured in a way which will prove useful. X ∇θU(θ) = ∇θ P r(τ; θ)R(τ) τ X = ∇θP r(τ; θ)R(τ) τ X P r(τ; θ) = ∇ P r(τ; θ)R(τ) P r(τ; θ) θ (18) τ X ∇θP r(τ; θ) = P r(τ; θ) R(τ) P r(τ; θ) τ X = P r(τ; θ)∇θlogP r(τ; θ)R(τ). τ However, equation 18 is an analytical solution which depends on all possible trajec- tories and for practical implementations these are not known. Instead the gradient is estimated by empirically averaging over m sampled trajectories and equation 18 then becomes

m 1 X ∇ U(θ) ≈ gˆ = ∇ logP r(τ; θ)R(τ). (19) θ m θ i=1 In order to explicitly state how the gradient estimate relates to the policy the trajectory probabilities are decomposed. For any trajectory i

" H # (i) Y (i) (i) (i) (i) (i) ∇θlogP r(τ ; θ) = ∇θlog P (st+1|st , at ) · πθ(at |st ) t=0 " H H # X (i) (i) (i) X (i) (i) = ∇θ log P (st+1|st , at ) + log πθ(at |st ) (20) t=0 t=0 H X (i) (i) = ∇θ log πθ(at |st ). t=0

(i) Because the transition probabilities P (st+1|st, at) does not depend on θ that term vanishes and left are terms only dependent on the policy πθ. This is a real useful result and means that information about the environment is not needed in the policy gradient estimate. Equation 20 is inserted in to equation 19 to produce

ˆ gˆ = Et [∇θlogπθ(at|st)Rt] , (21)

16 which is the estimated policy gradient also called the ”Vanilla” policy gradient. Algorithms which updates policies in the direction of 21 are referred to as REIN- FORCE after Williams [28] where he also showed that equation 21 is an unbiased estimate of the true gradient. Instead of using the actual returns Williams [29] showed that the variance of the REINFORCE estimation can be reduced by sub- tracting a learned function, called a baseline, from the reward to produce

ˆ gˆ = Et [∇θlogπθ(at|st)(Rt − bt(st))] . (22) The baseline function can be modeled in many ways to decrease the variance and one common and efficient approach is to use an estimate of the value function. This results in algorithms which both updates the policy and estimates a value function. These approaches are called actor-critic methods.

2.6.3 Actor-Critic Methods Actor-critic methods relies on the same principles as both value-based and policy- based approaches combined. The actor is the policy πθa (a|s) and the critic is θc(s) the value function approximation V . The parameters θa,c are different if two separate approximators are used, however, both functions may be approximated by a single parametric function. The most common way to define actor-critic algorithms is by the gradient estimate

ˆ h ˆ i gˆ = Et ∇θlogπθ(at|st)At , (23) ˆ where At is the advantage function estimate at time t. This advantage function can be defined in many different ways [30] but a common one is

ˆ At = Qt(at, st) − V (st), (24) which compares the value for taking a particular action Q(at, st) with the average expected value V (st). This is useful because if a certain action is associated with a better than average value the advantage is positive but if it is less the advantage is negative. From this follows that policies associated with positive advantages will be encouraged and the ones associated with the negative advantages will be diminished. The algorithm used in this project is an actor-critic type algorithm and is called Proximal Policy Optimization.

2.6.4 Proximal Policy Optimization Proximal policy optimization PPO [31], is an actor-critic type algorithm that up- dates the policy by estimating the policy gradient and an advantage function.

17 However, the gradient estimate differs slightly from that of ordinary actor-critic approaches. Instead of optimizing equation 23 it maximizes a surrogate loss. Let

πθ(at|st) rt(θ) = , (25) πθold (at|st) be the ratio between the distribution of the policy πθ after updates and an older version πold before said updates. By switching the policy logarithmic function in equation 23 with the ratio 25 the conservative policy loss is defined as   ˆ πθ(at|st) ˆ ˆ h ˆ i LCPI (θ) = Et At = Et rt(θ)At , (26) πθold (at|st) which is the basis in the trust region policy algorithm TRPO [32]. However, instead of also adding a condition to the optimization problem, as in TRPO, PPO uses a clipped version of the conservative policy loss in equation 26 defined as

ˆ h ˆ ˆ i LCLIP (θ) = Et min(rt(θ)At, clip(rt(θ), 1 − , 1 + )At) , (27) which is the PPO gradient estimate. Here  is referred to as the clip parameter, a hyperparameter set on initialization. In this project the PPO algorithm was im- plemented with a truncated version of the general advantage estimate [30] defined in the PPO paper [31].

The training is implemented in two steps, the exploration and the actual training. During the exploration phase the policy collects data for a certain number of steps, after which the general advantage estimate is calculated, and the policy is trained for a certain amount of epochs. Both the epochs and the steps taken during exploration are hyperparameters defined at initialization.

2.6.5 Exploration vs Exploitation The exploration vs exploitation dilemma is a concept in reinforcement learning which concerns how an agent should implement actions during training. Explo- ration is often done by inserting some randomness in to the actions implemented, which makes the policy sometimes choose worse actions over better ones. This could yield less returns in the short term but the additional information gathered could help to learn better policies in the future. Exploitation, on the other hand, means to choose the actions that yields the highest returns in the short term often also referred to as a greedy strategy. This yields the highest reward based on the current information the policy has gathered about the environment but could make many states left unexplored which could yield even larger returns. This effectively means that the policy could likely get stuck in local optima. There is no objective

18 best choice for choosing the ratio between exploration and exploitation and it is an open problem in reinforcement learning.

2.7 Pepper Pepper [17] is a humanoid robot from Softbank Robotics that can interact with people and, according to the manufacturer, perceive emotions and adapt its behav- ior accordingly. Pepper has a 3D-camera and two RGB-cameras in his facial area and can recognize faces and objects. He also has four directional microphones which can detect direction of sound sources and also analyze the tone of voice and what is being said in order to model users emotional context. Pepper moves around on three multi-directional wheels and uses 20 engines in order to perform fluent and human like motions. He has the ability to move his head, back and arms and has a built in battery which gives Pepper about 12 hours of autonomous interaction.

Figure 7: The humanoid robot Pepper [33] from Softbank Robotics.

In this project access to Pepper’s arms is the main focus. Each arm has 6 degrees of freedom but are constrained for certain ranges, listed in Table1, and shown in Figure 8.

2.7.1 Choregraphe A recommended first step when working with Pepper is with the program Chore- graphe [18]. In this program users can setup an interaction pipeline, visualize the robot’s movements and define new movements and more, a running instance of

19 Table 1: Arm actuators and ranges [34]

Joint Name, Motion (rotation axis) Range (degrees) RShoulderPitch Right shoulder joint, front and back (Y) -119.5 to 119.5 RShoulderRoll Right shoulder joint, right and left (Z) -89.5 to -0.5 RElbowYaw Right shoulder joint, twist (X’) -119.5 to 119.5 RElbowRoll Right elbow joint, (Z’) 0.5 to 89.5 RWristYaw Right wrist joint, (X’) -104.5 to 104.5 LShoulderPitch Left shoulder joint, front and back (Y) -119.5 to 119.5 LShoulderRoll Left shoulder joint, right and left (Z) 0.5 to 89.5 LElbowYaw Left shoulder joint, twist (X’) -119.5 to 119.5 LElbowRoll Left elbow joint, (Z’) -89.5 to -0.5 LWristYaw Left wrist joint, (X’) -104.5 to 104.5

Choregraphe is shown in Figure 9. In order to communicate with the robot from outside Choregraphe, the Qi-python-API [35] was used. This framework makes it possible to programmatically control Pepper which was relevant for this project.

2.8 OpenAI’s Gym In the field of reinforcement learning there have been a lack of common environ- ments for researchers to have as a baseline for reinforcement learning algorithms. Small differences in environments could have a large impact on the performance of reinforce learning algorithms and this makes them hard to compare and to re- produce research results. In an attempt to fill this void OpenAI [36] constructed the Gym framework [16, 37]. Gym contains a collection of Partially observable Markov decision processes (POMDP) in environments categorized as classic con- trol, algorithmic, Atari, board games and both 2D and 3D robot simulators. Gym is user friendly and has become a popular starting point for experimenting with re- inforcement learning as well as for serious research. The Gym framework provides a simple and convenient way to create Gym wrappers for custom environments.

A Gym environment consists of a few specific functions mainly the step, reset and render functions. The step function handles the transition between time steps in the environment, it takes an action as an input and returns the state of the envi- ronment, a reward and information whether the episode is done or not. The reset function is called at the start of the training and when the environment resets be- tween episodes. This function commonly only returns a state which is the initial state of an episode. The last of the three most common Gym functions is the render function which does exactly what its name implies and renders the episode.

20 (b) Right arm (a) Left arm

Figure 8: The movement specification of Peppers Arms[34].

2.8.1 Roboschool In the OpenAI Gym framework there are many available simulation environ- ments for reinforcement learning. Some of these are simple like the ”CartPole”- environment, seen in Figure 10, or more complex like the Atari and the robot simulation environments. When Gym was first released their robot simulation en- vironments all depended on the physics simulation engine Mujoco [38]. However, in order to use the Mujoco engine a license is required. For students one license for one computer can be retrieved for free but for other cases there are associated fees. This made the Gym environments dependent on Mujoco more difficult for practitioners to use and led to a demand for a more easily accessible system. A solution for this was the Roboschool environments shown in Figure 11. These environments did not depend on Mujoco but on Bullet [39] which is free to use.

3 Method

This section describes the work done in the project and how the experiments were setup and implemented. The first section is a description of the implementation of the reinforcement learning algorithm followed by sections on how the custom environments were created, the design of the reward function and neural networks. The method section ends with the experiments used to investigate whether the approach was functional.

21 Figure 9: A view of an open instance of Choregraphe showing the main program to the left and the robot view to the right.

Figure 10: The Gym Cartpole environment where the goal is to balance a beam.

3.1 Learning Algorithm The first focus of the project was to implement the learning algorithm, the Prox- imal Policy Optimization, described in section 2.6.4. This algorithm was chosen because of its promising performance and efficiency on robotic simulation tasks [31]. The implementation of the PPO algorithm was written in PyTorch [40, 41], which is a framework for machine learning with an emphasis on neural network optimization. The entire algorithm was implemented from scratch in order for full control and to learn about working with neural networks in a reinforcement learning context in practice.

The PPO algorithm requires specific data storage, exploration and training func- tions. The algorithm is a form of on-policy learning which means that data, sam- pled by following the current policy, is used for training but then discarded for new data sampled by the updated policy. The most relevant parts of the implementa-

22 Figure 11: An example from the roboschool environment showing a humanoid, a ”hopper” and a ”halfcheetah” doing a locomotion task. tion are explained below but for exact details the interested reader is referred to the PPO paper [31] and the implemented code [42].

First the ability to store data from several environments run in parallel was imple- mented. In this project the agents explore multiple instances of an environment at once and the data from these instances are collected simultaneously after each environment has returned data from a single step. The PPO algorithm works by exploring the environment and collecting data for a fixed amount of steps and then uses these data points to compute relevant values for the training. It is important to keep track of data points from the different processes because the relevant values needed for the training are temporally dependent, meaning that values from one process at one time needs an explicit connection to values in that process at the next time step.

After the data storage was implemented the exploration and training functions were created and are specific to PPO. The exploration function calls the step function in the environments, handles and transforms the data returned in appro- priate ways and iterates until exploration is finished. PyTorch is an automatic differentiation framework and for the actual training, one defines the specific PPO loss function and then the mechanisms of PyTorch runs backpropagation and up- dates the network using an optimizer. The chosen optimizer for this project was Adam, explained in section 2.4.2.

23 3.2 Pepper Environment The purpose of Pepper in this project was for training on a real robot. The specific traits of Pepper was not the focus and the design of the Pepper implementation was aimed to represent a setup for any generic robot. The many abilities intrinsic to Pepper and really his main functionalities, such as emotional recognition, face recognition and object detection and so on, are not exploited and only access to the actuators that controls his movements are utilized. The Choregraphe Suite program rendered Pepper and simulated all the actions he implemented. The Pepper environment was written as Gym wrapper, described briefly in section 2.8, where the step function would utilize the qi-python-API [35] to send actions to Choregraphe and then receive relevant information back. The goal was to define the actions of this environment as the actuator torques applied in the joints. The idea of using the direct torques as actions is that it requires no prior abilities of movement and fulfills the criteria of being as agent agnostic as possible. The state returned would be the current configuration of the actuator values, an observation would be the pixel rendering of Pepper and the reward would be a function of the distance between the current state and the target state.

Because of the way Pepper is constructed and programmed it proved difficult to get control over the actual torques in the actuators. However, an absolute or a rel- ative angle of each joint was possible to use as an action. In other words one could define a set of angles and, regardless of the current configuration of Pepper, the intrinsic movement abilities would implement the movements necessary to reach the defined configuration. This would assume prior knowledge of movement and therefore would make the training less agent agnostic. To circumvent this a small incremental angle was used to define an action. This incremental angle needed to be constrained in such a way as to make the increments small enough as to not rely heavily on prior knowledge, but not too small such that exploration in the reinforcement learning setup would be too constrained. The solution for this was to make each actuator action dependent on a fraction of its total angle range referred to as the max angle. Another parameter in the qi-API defined how fast Peppers movements were implemented, namely how much torque could be applied during a movement, and was a fraction between 0 and 1. The parameter values are listed in in Table 2. A complete action was defined by the output of the policy network, a value between -1 and 1 for each actuator, multiplied by the specific max angle for the individual actuators.

Movements were simulated in a window in Choregraphe called ”robot view” but the pixel values could not be directly transmitted over the API. The solution for this was to write a script that found the coordinates for the ”robot view” window

24 and retrieved the pixels from the screen, i.e the observations. Because of this the ”robot view” window was always visible thus making the rendering function of the gym wrapper obsolete, the rendering would take place in Choregraphe. The final thing to define was an initial starting state and for convenience the already implemented ”StandInit”-pose, the built in starting pose of Pepper was used.

Table 2: Action parameter values

Parameter Value Description Max angle 5% fraction of maximum for each joint Max speed fraction 10% Actions are careful not strong

3.3 Custom Roboschool Environment The custom roboschool environments are built based on OpenAI’s roboschool [19]. The custom environments created for this project consists of an agent static in space with the ability to move certain limbs. The environments are implemented from two directions, the definition of the physical environment simulated by Bul- let [39] and the wrapper for Gym. The Bullet part is the actual physics simulation and the entire environment for the 3D simulation is defined in a xml-file. In this file the information of where the joints and actuators will be located and how they connect to each other and other body parts are defined.

Two custom environments were implemented, namely the custom reacher environ- ment and the custom humanoid environment. The custom reacher environment is a customization of the already implemented reacher environment in roboschool and the other is an implementation of a more robot-like humanoid. The custom reacher environment consists of one arm moving in a plane with 2 degrees of free- dom (DoF). The 2 DoF comes from the fact that this arm only has 2 different joints both of which can only rotate around one axis. There were two versions of this environment, one where two spherical targets were visible and the goal was to get the color coded joints to the correlating targets. The other version had no explicit objects as targets but a target image and joint state values set in the gym part of the implementation outside of the simulation. This custom environment without the explicit targets is shown in Figure 12a. The other environment is a virtual upper torso of a humanoid which has 6 DoF moving in 3D space and is shown in Figure 12b.

When designing an environment for reinforcement learning it is important to note

25 (a) An observation from the custom (b) An observation from the custom hu- Reacher environment. manoid environment.

Figure 12: Custom environments. that every detail might play a part in the final outcome and therefore any decisions should be as thought through as possible. In Bullet there are several physical ob- jects that are attached to each other and can exert friction or dependencies. The reacher environment was created to be the most basic environment possible, there was no friction and the movements were defined in a plane perpendicular to grav- ity (no forces). The joint movements, in the plane, were unconstrained and the body parts could not interact with each other and no collisions were possible. The humanoid environment was meant to be more human like and therefore more com- plex and so in this environment the movements were constrained in order to model how humans can move based on the humanoids defined in roboschool, collisions were made possible and the limbs could interact.

The actions for the two custom environments were torques and represented by continuous values between -1 and 1, one value for each actuator. The observation was pixel values of the agent and the state was the coordinates of the joints, their velocities, the distance vector between the robot parts and the targets. The reward was calculated based on the distance vector between the robot and the targets.

3.4 Reward Function In reinforcement learning the reward function is what promotes which action and/or states that are ”good” or ”bad” and changing the reward function is to change the behavior that any agent in the environment learns. In this project the goal is for an agent to move its limbs such that specific key points on the agent aligns with specified target points. Therefore the reward is defined as a function of these key values and specifically the distance vector between them. However,

26 there are many different ways to define the exact properties of this function and a smaller qualitative experiment was conducted in order to decide on which specific reward function to use. There were three reward functions tested in this project, the absolute distance reward, the velocity reward and the latter with associated costs. The absolute reward function, Ra is defined by

v u n uX 2 D(s, starget) = t |si − starget,i| (28) i=0

Ra,t(s, starget) = −Dt(s, starget), (29) at time t, where the reward is the negative of the distance D(s, starget,t) which is the p2-norm or the euclidean distance between the key parts on the robot and the targets. This means that this reward function states that being far away from the target is worse than being close and the maximum value, the best reward, is zero. The second reward function is the velocity reward which includes a time dependency between subsequent states and is defined as

Rv,t = Dt−1 − Dt. (30) This function states that it is good to move towards the target position and bad to move away from it. Moving towards the target will yield a positive reward, moving away a negative reward and not moving at all yields zero reward. The final reward function is an extension of the velocity reward function and adds penalties or cost to the reward. The velocity cost reward is then defined as

Rc,t = Rvel,t − Costs, (31) where Costs is a function that might dependent on the specific action taken or a penalty for reaching a state defined as bad.

An experiment was conducted on the custom reacher environment with spherical targets to inspect which reward function was the better candidate. The cost func- tion for the velocity cost reward was defined as Costs = 0.1∗|action|. The different reward functions were trained using the same network architecture, for the same amount of frames and the highest test score policy was chosen to represent each reward function. These experiments were qualitative and not extensively tested because the framework should function with either one albeit that the learning efficiency might vary. During the development process it is vital to keep as many things as possible unchanged while adding different features, and instead of picking a reward function at random or to spend time on extensive testing, three policies were trained and based on the recorded videos a reward function was chosen.

27 3.5 Networks As a basic starting point for doing experiments three different network architecture approaches were implemented. The data used in the training can be divided in to two major types, the internal joint states and the external observations. These may be processed by completely separate networks, a combination of networks or by one single network.

The simplest network architecture was the modular approach where the coordina- tion was independent from the understanding during training time. The coordi- nation network exclusively trains on the internal state spaces, the current state of the agent and the target state. The understanding network trained on random ob- servations, in a supervised learning fashion, with the objective to correctly predict the corresponding internal state given an observation. During test time, where the internal state of the target is not known, the understanding network processes the observation and feed the predicted target state to the coordination policy.

The semi-modular approach is very similar but expands the input space to the coordination network. In this approach the coordination policy samples actions conditioned on all the information known during training, the observations of the agent and the target, as well as the internal states. The basic idea is that the coordination module could use the visual information in the observation to better learn actions. However, the same understanding network as in the modular ap- proach was used during testing to explicitly predict the internal target state.

The combine approach does not utilize a separate network for explicitly predicting the internal target states. This policy samples actions conditioned on the agents current internal state, its observation and the target observation. This way the training and the test data are the same and the coordination and understanding ability is implicitly realized in one single network.

3.5.1 Modular Approach

At time t let ot be the color pixel image observation and st be the internal joint c state of an agent. Let the coordination policy πθ be defined as a neural network with parameters θ. The policy maps the agents state and the target state into c c an action at ∼ πθ(st, starget) and a value vt = πθ(st, starget), where at is sampled from a distribution but vt is deterministic. Let the understanding module fθu be a network with parameters θu which maps an observation to a state approximation sˆt = fθu (ot). The policy πθc is updated according to the PPO algorithm while

28 the understanding module minimizes the mean squared error loss between the true internal states and the predictions. During training the agent has access to the complete pose of the agent, the target observation otarget and the target state starget, however, during test time only the external observation otarget is available. c Thus, during test time an action is given by at ∼ πtheta(st, sˆtarget).

The understanding module is a convolutional neural network with three convolu- tional layers with ReLu activations and then a fully connected architecture two layers with a ReLu activation and a linear layer output shown in Figure 13. The coordination network is a fully connected neural network with two hidden layers and tanh as the activation function. A schematic figure of this network is shown in Figure 14.

Figure 13: A graphic representation of the understand network. This network consists of three convolutional layers with ReLu activations and two fully connected layers with a ReLu as activation and a linear output.

3.5.2 Semi Modular Approach

semi In the semi modular approach the policy πθ maps the complete target pose (starget, otarget), the current external observation ot and the current internal state st semi to an action at. During training an action is sampled as at ∼ πθ (ot, st, starget, otarget). This policy uses all available information in order to compute an action during training but still needs the understanding module to approximate the internal tar- semi get state during testing and then at ∼ πθ (ot, st, sˆtarget, otarget).

This network is similar to the coordination network with the addition of a pixel embedding which is of the same architecture as the understanding network but

29 Figure 14: A graphic representation of the coordination network. This network consists of a fully connected neural network with two hidden layers and tanh activations. without the fully connected layers at the end. The values from the pixel embedding are concatenated with the state values and then fed through two hidden layers with tanh activation functions. The network is shown in Figure 15

3.5.3 Combined Approach In the combine approach the understanding module is omitted and only informa- tion available during testing is used for the training. The policy will be as in the semi modular approach except for the internal state of the target which is omitted. comb The combined policy actions are then given by at ∼ πθ (st, ot, otarget). Shown in Figure 16.

3.6 Experiment The experiments are meant to show that the created environments and the dancer in the mirror approach are functional for agents to learn to mimic poses. Being functional in this context means that the algorithm is able to improve and that the resulting behaviors match the expected behaviors. The experiments consists of sample runs in all of the created environments with some variation in the networks applied.

In all experiments the actions were sampled from a standard distribution where the mean was the output of the policy network and the standard deviation was a hyperparameter. The standard deviation was defined as a linearly decreasing function dependent on an initial and an end value. Other hyperparameters in-

30 Figure 15: A graphic representation of the semi modular network. The pixel em- bedding is the same convolutional network as the understand network but without the fully connected layers at the end. clude learning rate, episode length, total number of frames and the dimension of the neural networks and more. The experiments all used the same hyperparameter values specific for the PPO algorithm and are shown in Table 3 along with other hyperparameters presented in Table 4 and 5.

How well an agent could mimic poses was determined through an evaluation scheme based on a threshold distance and a set duration, along with recorded videos. The best policies learned during training were evaluated by having them reach as many poses as possible in certain number of frames. A pose was considered reached if the total potential, the sum of the absolute distances between the agent and the target, was below a certain threshold for a certain number of consecutive frames. The target was updated if a pose was accomplished or after trying for a maximum amount of frames, the length of an episode used during training.

3.7 Custom Reacher Experiments For the custom reacher environment the understand network was trained sepa- rately in a supervised learning fashion on a randomly generated dataset. The dataset consisted of 200000 training data points and a validation set with 100000 data points. It trained for 300 epochs and the iteration with the lowest valida- tion score was used as the understand network. The same understand network

31 Figure 16: A graphic representation of the combine network. This network is like the semi-modular network but without the targets internal state values.

Table 3: A table of the PPO hyperparameters used in the experiments

Parameter Value Clip 0.2 Exploration steps 2048 Epochs 8 τ, value for the GAE[30] 0.95 Epsilon 10−8 was used for the modular and the semi-modular approach. Both the reacher and the humanoid environments tested the performance of policies at certain intervals throughout the training. These tests were defined as the average total reward collected over 5 episodes.

In the reacher environment the best policy was used to mimic dynamic targets. Dynamic targets were collected by storing a sequence of connected poses which were used as targets and switched in order every n:th frame. In the recordings the trained agent had n frames to reach a certain target pose after which that target was switch to the subsequent one.

3.8 Custom Humanoid Experiments A sample run using the coordination part of the modular approach with ground truth target internal states was implemented. The parameters are listed in Table 4

32 and 5.

Table 4: Hyperparameters for the reacher, humanoid and Pepper experiments

Parameter Reacher Humanoid Pepper Frames 5 · 106 5 · 106 5 · 106 Episode length 300 300 300 Learning rate 3 · 10−4 3 · 10−4 3 · 10−4 Std-start e−0.6 e−0.6 e−0.6 Std-stop e−1.7 e−1.7 e−1.7 Hidden 256 256 256 Feature maps [64, 64, 64] - - Kernel size (5, 5) - - Strides (2, 2) - - Observation size (3, 40, 40) - -

Table 5: Total number of parameters for each of the networks used

Network Reacher Humanoid Pepper Modular (only the coordination module) 69379 75527 78605 Semi-modular 265507 - - Combine 264483 - - Understand 243204 - -

3.9 Pepper Experiments The sample trainings of the Pepper environment utilized two different reward functions. The experiments used the coordination network with ground truth internal state targets and did not utilize the understand module. The first run was the most basic and utilized the absolute reward function and used only one single static target. The second training used the same target but the velocity reward function and finally the last training used the velocity reward but on random targets, presented in Table 6 . The hyperparameters are summarized in Table 3, 4 and 5.

33 Table 6: Pepper training

Run Reward function Target Learning rate Hidden Duration (hours) 0 Absolute Static 3 · 10−4 256 6.5 1 Velocity Static 3 · 10−4 256 11.5 2 Velocity Random 3 · 10−4 256 16

4 Results

In this section the data from the experiments are presented. First the experiments regarding the reward functions are presented, followed by the custom roboschool environment and ending with the Pepper environment.

4.1 Reward Function The Results for the reward function evaluation are presented in Table 7 and are qualitative videos used to choose one reward function over others to be used in the development stages of the environment. The videos shows the behavior of the different policies on the reacher environment. The reacher agent has two key points, one red and one green, and the goal was to get as close as possible to the target points with the matching colors. These videos indicate that the velocity reward is the most useful reward function for this project.

Table 7: Links to the different reward function videos

Absolute https://www.youtube.com/watch?v=rMlCqyAw88w Velocity https://www.youtube.com/watch?v=qyNpae7VgC4 Velocity and costs https://www.youtube.com/watch?v=WrA0X3BGA5I

4.2 Experiment The results are presented as data graphs provided during training. For smoothness the values in the graphs are running averages over the last 200 episodes. The most important metric is the average reward and that it is increasing during training. The test score reward were the average rewards collected for 5 episodes during even intervals in the training. The test scores were used in order to determine the best policy available from a training run. The goal of the experiments is to maximize the average reward the agent receives. The data presented during training includes the average collected reward, the standard deviation of collected reward, the policy loss, the value loss and test scores. For readability all graphs are presented from

34 the modular reacher training but only the average collected reward for all other training runs.

4.2.1 Reacher Environment For the modular training on the reacher environment all graphs provided during training are shown. In Figures 17 to 20 all the data, except the average reward, from the modular reacher training are presented. The graphs in Figures 21, 22 and 23 depicts the average reward from the sample training runs for the three different network architectures on the reacher environment. The policy with the highest test score during training was saved and later used for the pose evaluation presented in Table 8. Links to videos of these policies are presented in Table 9. In Table 10 are links to videos showing the behavior of the modular approach, the second entry in Table 8, for the reacher environment but on dynamical targets.

Figure 17: The standard deviation of rewards for the modular reacher training. A high average reward along with a small standard deviation is preferable. A small standard deviation indicates that the policy is reliable and acts consistent. A large standard deviation means that the agent collects both relatively high and relatively low average rewards indicating that the agent is inconsistent.

35 Figure 18: The policy loss (magnitude) for the reacher modular training. The policy loss is the loss correlating with the actor in the actor-critc algorithm. A small loss indicates that the policy performs well and does not need to change as much. However, it is difficult to extrapolate much information from the policy loss but it could be useful when debugging the algorithm.

Figure 19: The value loss for the reacher modular training. This is the loss for the critic in the actor-critic policy. A small value loss indicates that the agent correctly predicts the value corresponding to the visited states.

36 Figure 20: The test reward from the reacher modular training. Tests were per- formed at even intervals throughout the training and the best scoring policy was saved for evaluation.

Figure 21: The average reward from the reacher modular training. The goal of the training is to maximize the average reward.

37 Figure 22: The average reward for the semi modular reacher training. The goal of the training is to maximize the average reward.

Figure 23: The average reward for the combined reacher training. The goal of the training is to maximize the average reward.

38 4.2.2 Humanoid Environment The average collected rewards from the sample training of the humanoid modular training is shown in Figure 24. The test policy with the highest score was saved and used in the pose estimation presented in Table 8. The potential from the target is shown in Figure 25 for an evaluation run running for 1200 frames. Graphs over reward standard deviation, policy and value loss and test score have been left out for readability.

Figure 24: The average reward collected during the humanoid modular training. The goal of the training is to maximize the average reward.

4.3 Pose Evaluation The best policies from each training run was used for a pose evaluation. In a fixed amount of frames the goal was to reach as many poses as possible. The results from the pose evaluation is presented In Table 8. Links to videos showing the performance of each policy during the pose evaluation are presented in Table 9. For the reacher environment videos were recorded as a qualitative evaluation on how well the behaviors that mimic static targets would do on dynamical targets. Links to the results are presented in Table 10. The target is switched every n:th frame and for n = 10 the agent is fast enough to mimic the motions but for n = 3 and n = 1 the agent can not mimic the dynamic poses correctly.

39 Table 8: The pose evaluation results for the different custom roboschool policies trained. The fraction of completed poses versus the total attempted poses.

Approach Poses reached/total (%) Reacher Modular with ground truth target states 119/123 (96.7) Modular with understanding target state prediction 115/121 (95.0) Semi-modular with ground truth target states 94/98 (95.9) Semi-modular with understanding target state pre- 86/91 (94.5) diction Combine 85/95 (89.5) Humanoid Modular with ground truth target states 0/34 (0)

Table 9: Pose mimic evaluation videos. Same policies as in Table 8 but recorded for less frames. The videos show the agent to the left and the target poses to the right.

Approach URL Modular + ground truth https://www.youtube.com/watch?v=OsiIsOq3_5Q Modular + understand https://www.youtube.com/watch?v=q4RH8UkrQAI Semi-Mod + ground truth https://www.youtube.com/watch?v=HRaYJ-UImik Semi-Mod + understand https://www.youtube.com/watch?v=_1Fakj3P8uk Combine https://www.youtube.com/watch?v=qDh0VEyhzQg

Figure 25: Potential from targets for the modular humanoid training. The x-axis represents frames, the y-axis the potential away from a target and the red lines are where the target is changed.

40 Table 10: Link to videos where an agent to the left mimics dynamical targets to the right. The targets are updated every N:th frame and the videos are sped up such that the target moves in ”real time” and the agent N-times faster.

N url 10 https://www.youtube.com/watch?v=a_Ver7HfRMs&feature=youtu.be 3 https://www.youtube.com/watch?v=YloLFdsIXeA&feature=youtu.be 1 https://www.youtube.com/watch?v=XoDd66nIyVk&feature=youtu.be

4.4 Pepper In the following section the data from the Pepper training sessions are presented. For readability only the average reward metric is presented here along with the evaluations. The average reward from the three sample training sessions are found in Figures 26, 27 and 28. The last policies at the end of the sample runs were used to produce videos shown in Table 11. In these videos the agent is trying to reach the ”hurray pose” depicted in the right hand side of the video. In Figure 29 are the correlating potentials away from the target for the different videos where the red lines indicates where the episode is reset. The goal is to have a value for the potential that is as small as possible.

Figure 26: The average reward from training on the Pepper environment with the absolute reward function on one target. The training ran for 300k frames and took 6.5h.

41 Figure 27: The average reward from training on the Pepper environment with the velocity reward function on one target. The training ran for 550k frames and took 11.5h.

Figure 28: The average reward from training on the Pepper environment with the velocity reward function and on random targets. The training was for 800k frames and took 16h.

42 Table 11: Links to Youtube videos presenting the behavior of the three Pepper training runs. The videos shows the agent to the left and a target pose to the right.

Trained on the relevant target with the absolute reward for 300k frames:

https://www.youtube.com/watch?v=yBP7v3WjtTw&feature=youtu.be Trained on the relevant target with the velocity reward for 550k frames:

https://www.youtube.com/watch?v=bLgIKDQU0dg&feature=youtu.be Trained on the random targets with the velocity reward for 800k frames:

https://www.youtube.com/watch?v=TTVn1nHdNaY&feature=youtu.be

43 (a) The policy trained on a single target (b) The policy trained on a single target with the absolute reward. Evaluated on the with the velocity reward. Evaluated on the ”hurray”-pose. ”hurray”-pose.

(c) The policy trained on random targets (d) The policy trained on random targets with the velocity reward function. Evalu- with the velocity reward function. Evalu- ated on the ”hurray”-pose. ated on 5 random poses.

Figure 29: Graphs of the potential from targets correlating to the videos in Table 11 where the x-axis represents the number of frames and the y-axis potential from targets. The agent is set to the initial pose at every 300 frames indicated by the thin red lines.

4.5 Code All code that was used in this project is freely available on github https:// github.com/ErikEkstedt/Gestures and the installation settings are found in the README.

44 5 Discussion and Future Work

The main goal of this project was to create a basic framework for learning social movements and applying deep reinforcement learning, in an end to end fashion, for agents with as little prior knowledge as possible. The basic idea named the dancer in the mirror approach was meant to be plausible as a training setup for actual robots as well as virtual ones. Two environments were created for this purpose. The custom roboschool environment was designed for virtual agents and the Pep- per environment was designed as a starting point for implementing the training in an actual robot. In the following sections the results from the experiments for the specific environments are discussed and then some general remarks on the project as a whole ending with future work.

5.1 Custom Roboschool Environment The custom roboschool implementation took place after spending much time on the slow Pepper environment and training without any functional results. Because of this it was important to check that every step when implementing the new en- vironment was carefully executed and the choices made well considered. The first thing to make sure worked was the parallelization of the simulations making the training run faster and be more diverse. The Pepper environment could not be run in parallel and was very slow making the development process cumbersome. However, this custom roboschool environment was built to be used in machine learning frameworks and made the training and the debugging faster.

The experiments on the reacher environment shows that the average reward was increasing during training for all runs, seen in Figure 21, 22 and 23. This indi- cates that the environment is functional because the algorithm learned behaviors to collect more and more rewards. The pose evaluation shows that the trained reacher policies were able to mimic poses with high accuracy as presented in Table 8. The difference between using the ground truth internal states did not show a huge improvement from using the predicted targets states and this was probably due to the low complexity of the reacher environment. The observations were from a single point of view with everything visible and no complex background or other objects present making the distribution of possible observations quite small. The understanding network could possibly memorize the data and predict the states with a high degree of accuracy instead of actually generalize.

Having the ability to mimic static targets is a first step, but social movements are dynamical by nature. After successfully being able to mimic static poses the next

45 step was to test the performance on movements. Movements are sequences of static poses and therefore the same tests could be run as for the static case but having continuous targets updated every n:th frame. As seen in the videos in Table 10 the models holds up for updates every 10 frames but breaks down for 3 and 1 frames per update and can not keep up with the moving targets. For real time mimicking where the targets are updated every frame the model can not follow the targets at all. The policy did not know anything about the dynamics of the target and can therefore not choose actions that mimics the movements fast enough. The state targets used in the pose mimicking does not contain any velocity values and the policy only receives one data point at a time as input. In other words, the agent had no information of movement of the targets and it is expected that it would do poorly on the dynamic pose mimicking task.

The humanoid environment was more complex than the reacher and had depen- dencies which made it vulnerable to more problems during training. One problem was that the joints could get stuck and if a policy were to get unlucky to learn this behavior it could be almost impossible to learn anything else. The way that the limbs interact with the friction and movement constrictions was not fully ex- plored but did make the environment complex. These traits combined with the fact that the humanoid operated in 3D space with a force of gravity indicates that the training time should be longer than that of the reacher. The trained humanoid policy did not reach any poses during the pose evaluation. However, the sample training session shows that the average reward was increasing, see Figure 24. The evaluation, presented in Figure 25, shows that this policy does decrease the poten- tial away from targets but has not learned to fully reach them and stop moving. Instead when close to the target the agent moves around making the potential os- cillate in an irregular fashion. To disincentivize this behavior the reward could add a cost for moving as in the velocity reward function with costs defined in section 4.1.

There were many constraints on these custom environments which makes them not suitable for training agents to learn social behaviors. The agents were static in space with only movement abilities of the limbs. The distribution of possible observations was clearly constrained in only implementing targets from a single point of view. This property alone makes the environment ill-suited for general behaviors and one can imagine many ”shortcut” behaviors that would break down if the view angle was changed ever so slightly.

5.2 Pepper All the training sessions shown in section 4.4 did improve the average expected reward, shown in Figure 28, 26 and 27. This increase indicates that the framework

46 is functional and that the policies can learn. The first Pepper session trained on a single target with the absolute reward and in the video, linked in Table 11, it is shown how the policy moves towards the ”hurray”-pose by raising its arms but has not yet learned to bend the elbows appropriately. After having both arms straight out to the sides the policy diverges from the target. This policy was trained for the least amount of frames and used the most basic reward function so it is expected that it was to perform worse out of the three sample trainings. The second training also trained on a single target but with the velocity reward function and for a longer duration. The potential graph collected during the video indicates that it learned to reach the specific target to some degree. It moved slightly around an average potential close to the target as shown in Figure 29b. From the training graph in Figure 27 after 400k frames there is a stagnation of the returns which indicates that the policy found a local optima and could not increase its performance. If the resulting behavior is the desired one then more complex tasks should be considered. The training utilized one specific target and therefore the behaviors learned are not general, however, it is valuable to know that the framework is able to learn how to reach at least one pose.

The final training was done on random targets with the velocity reward function and trained for the longest duration. The chosen ”hurray”-pose was never reached and the policy movements stagnated in a position of a certain potential from the target as seen in the graph in Figure 29c and in the corresponding video. When evaluating the policy on five random targets it is shown in Figure 29d that it has learned to move towards the target poses but it has trouble moving all the way there and is moving around a potential at a distance close to one for the five random targets, the same behavior seen for the ”hurray”-pose. The result is still interesting and it seems plausible that if trained for more frames and experiment- ing with the neural network architecture the task of reaching a goal pose is possible.

Despite being trained for less frames than the other policies trained on the custom roboschool environments the Pepper training produced policies that, from looking at the videos in Table 11, actually could mimic poses to some extent. Comparing the humanoid potential graph in Figure 25 with the Pepper potentials in Figure 29b, 29c and 29d indicates that the behavior learned in the Pepper environment surpasses that of the humanoid environment. This, most likely, is due to the prior movement abilities of Pepper. This prior ability was taken into consideration when designing the actions but can not be dismissed entirely, perhaps not even at all.

47 5.3 Project This project was designed to be an exercise in thinking about a high level problem, such as learning social movements, then what the requirements for such a problem would be and how to implement a simple starting point as a basis of research. By trying to define as many aspects of the learning as possible, such as the environ- ments, the algorithms, the neural networks and the optimization, one might have a better chance at learning how all the different concepts work together and what is worth to dig deeper into. A reason to implement a learning framework is that they may play an ever increasing role in the field of AI and machine learning in the future. As algorithms become more and more efficient at extracting relevant information and learning, the more important role these frameworks will play. For the higher abstraction fields of research such as human-robot-interaction these frameworks might become the major factor in the way machines will learn different behaviors. Therefore, it may be useful to get an understanding about how such frameworks work and what the important factors are when designing them. The idea in this project was to start small on a problem that I was confident would be possible to solve, in this case static pose mimicking. After a basic framework was functional then one could see how changes made to the system would change the behaviors of the algorithms, what would be possible to learn and then ways of incrementally upgrading it. By training on behaviors in this incremental fashion there could come surprises were a behavior was thought of as being as difficult as another but proved to be much harder, thus yielding a signal of where to focus the research.

In this project all things used were custom built and it was hard to estimate how long any practical implementation would take. This was especially true when re- inforcement learning was part of the pipeline from the start. In the case of the Pepper environment there were difficulties reaching convergence and led to uncer- tainties about the learning framework’s functionality. Because of the long training time of the Pepper environment the debugging process was difficult and progress was slow. This finally led to a switch of environment to the roboschool framework more suited for machine learning. This means that time was spent to learn both how the Pepper and the roboschool environments worked and how they could be customized in a way suited for this project. However, after the switch the train- ing got faster, the debugging easier and soon the roboschool environments were functional. At the end of the project the Pepper environment was reformatted to the framework created around roboschool making it functional. Starting with an appropriate environment such as roboschool could have saved time. The problems encountered when implementing the environments in which to train the networks took longer than anticipated and left little time for the actual machine learning

48 experiments which was meant to be a major focus of the project.

5.4 Future Work The implementation using Choregraphe as a simulation environment for the rein- forcement learning setup was functional albeit not an efficient approach. Future work for the Pepper part of the framework is to both speed up and Parallelize the training. Therefore, looking into other robot simulation platforms such as ros [43] and Gazebo[44] is encouraged. To be able to train faster and in parallel will become increasingly important when the goal behaviors get more complicated. If longer training sessions for the Pepper environment could show that reaching arbitrary static poses are possible and can be done reliably. The behavior could be improved by adding some costs to the reward function minimizing the amount of force used, this could then make the policies stay still in the relevant poses instead of restlessly moving around. If static poses are reliably reached then dynamic pose mimicking training would be the next step. However, when dealing with robots or systems with prior knowledge of movement it is arguably unnecessary to learn movements in an end to end fashion. The most difficult part of the current task is probably to sufficiently classify another agents pose. If a translation between the target pose and some internal representation can be made then it would not matter if the actual movements were implemented by a learned algorithm or not. Arguably the role of the neural networks should be to learn when, how and why to implement a certain action but could then rely on movements designed by con- trol theory and mainstream robotic algorithms. This is especially true for real life robots where expensive components and systems are used.

For the custom roboschool environments investigation in better reward function for the humanoid environment should be considered. More and longer training sessions on the humanoid environment should be performed to produce better coordination networks as well as a viable understanding network. If the desired behaviors could be learned reliably then there are many things to do in order to improve the environment. These improvements could be to extend the environments to incorporate moving agents and targets along with more complex tasks and their related reward functions.

6 Conclusion

The conclusion for this project is that, in a virtual setting, the dancer in the mir- ror approach is functional as a way of mimic poses. The implemented setup did

49 not generalize for multiple viewing angles and to mimic other kinds of agents with different traits then the agent training. Both of these aspects are important fac- tors for any social context. The project shows that the setup is viable but that in order to implement a learning framework that could be used to train movements for real social contexts a larger project involving multiple parties should be con- sidered. The last part of the conclusion answers the research questions posed in the beginning of the thesis.

Can the dancer in the mirror approach train agents to mimic poses?

The results for the simple reacher environment, seen in Table 8 and 9, shows that it is possible to learn to mimic poses by a dancer in the mirror approach. This en- vironment is however simple and the external observations of both the target and the agent are from one single viewing angle, a small subset of the data distribution needed for social mimicking in the wild. From the training graphs in Figure 24 and 28 for both the humanoid and Pepper, the average reward is increasing meaning that the systems does become better.

Can the humanoid robot Pepper learn to mimic poses in the same way?

The videos of Pepper in Table 11 and the potential graph in Figure 29 shows that the dancer in the mirror approach seems sufficient for Pepper to mimic poses. This is the case if trained on one target with the velocity reward as in the second run but it also seems plausible that, with more training, it could reach random poses to a sufficient degree. The Potential graph in Figure 29 indicates that it has learned to execute movements to get close to the random targets but oscillates at a certain distance away from the target. This should be possible to solve with more training. However, the observations are from a single point of view and is a small subset of possible virtual observations, not to mention the distribution of observations encountered in the wild.

Can this be generalized to mimic dynamic motions and more complex behaviors?

There was no extensive training done on dynamic targets but the policy for the modular approach on the reacher environment trained on static targets was eval- uated on dynamic targets. In this evaluation it is shown that the algorithm is fast enough to mimic movements only if it gets extra time between target frames to reach their configuration, see videos in Table 10. This algorithm is not sufficient to mimic movements but the result indicates that it could be plausible that with

50 appropriate training the dancer in the mirror approach could be utilized to in- corporate dynamic targets. More training on the specific task to mimic dynamic targets needs to be implemented before any assessment about more complex tasks can be done.

References

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classi- fication with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran Associates, Inc., 2012, pp. 1097–1105. url: http://papers.nips.cc/paper/4824-imagenet-classification- with-deep-convolutional-neural-networks.pdf. [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “ImageNet: A Large-Scale Hierarchical Image Database”. In: CVPR09. 2009. [3] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern- stein, Alexander C. Berg, and Li Fei-Fei. “ImageNet Large Scale Visual Recognition Challenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252. doi: 10.1007/s11263-015-0816-y. [4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioan- nis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. “Playing Atari with Deep Reinforcement Learning”. In: CoRR abs/1312.5602 (2013). arXiv: 1312.5602. url: http://arxiv.org/abs/1312.5602. [5] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and . “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529 (2016), pp. 484–503. url: http://www.nature.com/nature/journal/ v529/n7587/full/nature16961.html. [6] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. “Mastering Chess and Shogi by Self-Play with a General Re- inforcement Learning Algorithm”. In: ArXiv e-prints (Dec. 2017). arXiv: 1712.01815 [cs.AI].

51 [7] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. “’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”. In: CoRR abs/1609.08144 (2016). arXiv: 1609.08144. url: http://arxiv.org/abs/1609.08144. [8] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. “WaveNet: A Generative Model for Raw Audio”. In: CoRR abs/1609.03499 (2016). arXiv: 1609.03499. url: http://arxiv.org/abs/ 1609.03499. [9] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. “The Arcade Learn- ing Environment: An Evaluation Platform for General Agents”. In: Journal of Artificial Intelligence Research 47 (2013), pp. 253–279. [10] Norman Tasfi. PyGame Learning Environment. https : / / github . com / ntasfi/PyGame-Learning-Environment. 2016. [11] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich K¨uttler,John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Jacob Repp, and Rodney Tsing. “StarCraft II: A New Chal- lenge for Reinforcement Learning”. In: CoRR abs/1708.04782 (2017). arXiv: 1708.04782. url: http://arxiv.org/abs/1708.04782. [12] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wain- wright, Heinrich K¨uttler,Andrew Lefrancq, Simon Green, V´ıctor Vald´es, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. “DeepMind Lab”. In: CoRR abs/1612.03801 (2016). arXiv: 1612.03801. url: http://arxiv.org/abs/1612.03801. [13] J. Z. Leibo, C. de Masson d’Autume, D. Zoran, D. Amos, C. Beattie, K. Anderson, A. Garc´ıaCasta˜neda,M. Sanchez, S. Green, A. Gruslys, S. Legg, D. Hassabis, and M. M. Botvinick. “Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents”. In: ArXiv e-prints (Jan. 2018). arXiv: 1801.08116 [cs.AI].

52 [14] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian. “Building Generalizable Agents with a Realistic and Rich 3D Environment”. In: ArXiv e-prints (Jan. 2018). arXiv: 1801.02209 [cs.LG]. [15] I. Casanueva, P. Budzianowski, P.-H. Su, N. Mrkˇsi´c,T.-H. Wen, S. Ultes, L. Rojas-Barahona, S. Young, and M. Gaˇsi´c.“A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management”. In: ArXiv e-prints (Nov. 2017). arXiv: 1711.11023 [stat.ML]. [16] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. 2016. eprint: https://arxiv.org/abs/1606.01540. [17] Softbank Robotics Europe. Pepper. 2018. url: https://www.ald.softbankrobotics. com/en/robots/pepper. [18] E. Pot, J. Monceaux, R. Gelin, and B. Maisonnier. “Choregraphe: a graphical tool for humanoid robot programming”. In: RO-MAN 2009 - The 18th IEEE International Symposium on Robot and Human Interactive Communication. 2009, pp. 46–51. doi: 10.1109/ROMAN.2009.5326209. [19] OpenAI. Roboschool. url: https://blog.openai.com/roboschool/. [20] Rick van Baaren, Loes Janssen, Tanya L. Chartrand, and Ap Dijkster- huis. “Where is the love? The social aspects of mimicry”. In: Philosophical Transactions of the Royal Society of London B: Biological Sciences 364.1528 (2009), pp. 2381–2389. issn: 0962-8436. doi: 10.1098/rstb.2009.0057. eprint: http://rstb.royalsocietypublishing.org/content/364/1528/ 2381 . full . pdf. url: http : / / rstb . royalsocietypublishing . org / content/364/1528/2381. [21] HELLKNOWZ Chrislb. ANN. 2018. url: https://commons.wikimedia. org/wiki/File:MultiLayerNeuralNetworkBigger_english.png. [22] Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebastian Seung. “Digital selection and analogue amplifica- tion coexist in a cortex-inspired silicon circuit”. In: Nature 405 (2000), 947 EP –. url: http://dx.doi.org/10.1038/35016072. [23] J. Schmidhuber. “Deep Learning in Neural Networks: An Overview”. In: Neural Networks 61 (2015). Published online 2014; based on TR arXiv:1404.7828 [cs.NE], pp. 85–117. doi: 10.1016/j.neunet.2014.09.003. [24] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Opti- mization”. In: CoRR abs/1412.6980 (2014). arXiv: 1412.6980. url: http: //arxiv.org/abs/1412.6980.

53 [25] Chrislb. RNN. 2018. url: https://commons.wikimedia.org/wiki/File: RecurrentLayerNeuralNetwork.png. [26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fid- jeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. “Human-level control through deep reinforcement learn- ing”. In: Nature 518 (2015), 529 EP. url: http://dx.doi.org/10.1038/ nature14236. [27] Volodymyr Mnih, Adri`aPuigdom`enech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous Methods for Deep Reinforcement Learning”. In: CoRR abs/1602.01783 (2016). arXiv: 1602.01783. url: http://arxiv.org/abs/1602.01783. [28] Ronald J. Williams. “Simple statistical gradient-following algorithms for con- nectionist reinforcement learning”. In: Machine Learning 8.3 (1992), pp. 229– 256. issn: 1573-0565. doi: 10.1007/BF00992696. url: https://doi.org/ 10.1007/BF00992696. [29] RONALD J. WILLIAMS and JING PENG. “Function Optimization using Connectionist Reinforcement Learning Algorithms”. In: Connection Science 3.3 (1991), pp. 241–268. doi: 10.1080/09540099108946587. eprint: https: //doi.org/10.1080/09540099108946587. url: https://doi.org/10. 1080/09540099108946587. [30] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. “High-Dimensional Continuous Control Using Generalized Advan- tage Estimation”. In: CoRR abs/1506.02438 (2015). arXiv: 1506 . 02438. url: http://arxiv.org/abs/1506.02438. [31] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy Optimization Algorithms”. In: CoRR abs/1707.06347 (2017). arXiv: 1707.06347. url: http://arxiv.org/abs/1707.06347. [32] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. “Trust Region Policy Optimization”. In: CoRR abs/1502.05477 (2015). arXiv: 1502.05477. url: http://arxiv.org/abs/1502.05477. [33] Softbank Robotics Europe. Pepper. Feb. 11, 2018. url: https://commons. wikimedia.org/wiki/File:Pepper_the_Robot.jpg. [34] Softbank Robotics. Softbank Robotics. Feb. 11, 2018. url: http://doc. aldebaran.com/2-5/family/pepper_technical/joints_pep.html. [35] Softbank Robotics. qi. url: http://doc.aldebaran.com/2-4/dev/libqi/ api/python/index.html#py-api-index.

54 [36] OpenAI. OpenAI. Feb. 5, 2018. url: https://openai.com/. [37] Greg Brockman John Schulman. OpenAI Gym Beta. Apr. 27, 2016. url: https://blog.openai.com/openai-gym-beta/. [38] E. Todorov, T. Erez, and Y. Tassa. “MuJoCo: A physics engine for model- based control”. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, pp. 5026–5033. doi: 10 . 1109 / IROS . 2012 . 6386109. [39] E. Couman. Bullet Physics. url: https://pybullet.org. [40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. “Automatic differentiation in PyTorch”. In: (2017). [41] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. PyTorch. url: https://github.com/pytorch/pytorch. [42] Erik Ekstedt. Repository for project. url: https://github.com/ErikEkstedt/ Gestures. [43] Morgan Quigley, Ken Conley, Brian P. Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y. Ng. “ROS: an open-source Robot Operating System”. In: ICRA Workshop on Open Source Software. 2009. [44] N. Koenig and A. Howard. “Design and use paradigms for Gazebo, an open- source multi-robot simulator”. In: 2004 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566). Vol. 3. 2004, 2149–2154 vol.3. doi: 10.1109/IROS.2004.1389727.

55