Machine Learning for Synchronized Swimming

Machine Learning for Synchronised Swimming

Siddhartha Verma, Guido Novati Petros Koumoutsakos

FSIM 2017 May 25, 2017

CSElab Computational Science & Engineering Laboratory http://www.cse-lab.ethz.ch Motivation

• Why do ﬁsh swim in groups? • Social behaviour? • Anti-predator beneﬁts • Better foraging/reproductive opportunities • Better problem-solving capability

• Propulsive advantage? Credit: Artbeats

• Classical works on schooling: Breder (1965), Weihs (1973,1975), … • Based on simpliﬁed inviscid models (e.g., point vortices) • More recent simulations: consider pre-assigned and ﬁxed formations Hemelrijk et al. (2015), Daghooghi & Borazjani (2015)

• But: actual formations evolve dynamically Breder (1965) Weihs (1973), Shaw (1978) • Essential question: Possible to swim autonomously & energetically favourable? Hemelrijk et al. (2015) Vortices in the wake

• Depending on Re and swimming- kinematics, double row of vortex rings in the wake

• Our goal: exploit the ﬂow generated by these vortices Numerical methods

0 in 2D

• Remeshed vortex methods (2D) @! 2 + u ! = ! u + ⌫ ! + ( (us u)) • Solve vorticity form of incompressible Navier-Stokes @t ·r ·r r r⇥ Advection Diffusion Penalization • Brinkman penalization • Accounts for ﬂuid-solid interaction Angot et al., Numerische Mathematik (1999)

• 2D: Wavelet-based adaptive grid

• Cost-effective compared to uniform grids Rossinelli et al., J. Comput. Phys. (2015)

• 3D: Uniform grid Finite Volume solver Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado Interacting swimmers L

L2/T • Simple model of ﬁsh schooling: a leader and follower Re = = 5000 T ⌫

Trailing fish’s head intercepts positive vorticity: velocity increases leading fish trailing fish

Initial tail-to-head distance = 1 L

Swimming in a wake can be beneﬁcial or detrimental

Trailing ﬁsh’s head intercepts negative vorticity: velocity decreases

Initial tail-to-head distance = 1.25 L The need for control • Without control, trailing ﬁsh gets kicked out of leader’s wake

• Manoeuvring through unpredictable ﬂow ﬁeld requires: • ability to observe the environment • knowledge about reacting appropriately

• The swimmers need to learn how to interact with the environment! Reinforcement Learning: The basics

• An agent learning the best action, through trial- and-error interaction with environment

> Learning rate (alpha) is the equivalent • Actions have long term consequences of SOR: U(s) = (1-a) U(s) + a (R(s) + g U(s’)) > Discount factor (gamma) determines how far into the future we want to • Reward (feedback) is delayed account for Discount factor : This models the fact that future rewards are worth less than immediate • Goal rewards.

“The discount factor γ {\displaystyle \gamma } \gamma determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 • Maximize cumulative future reward: will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1 {\displaystyle \gamma =1} \gamma =1, without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.[2] Even with a discount factor only slightly lower than 1, the Q-function learning leads to propagation of • Specify what to do, not how to do it errors and instabilities when the value function is approximated with an artificial neural network.[3] In that case, it is known that starting with a lower discount factor and increasing it towards its final value yields accelerated learning. [4]”

• Backpropagation: A Example: Maze solving • Agent receives -11 -12 00 feedback State Agent’s position (A) • Expected reward -12 -11 -10 -2 -1 updated in previously Actions go U, D, L, R visited states -13 -9 -8 -2 • Now we have a policy -1 per step taken Reward -14 -15 -7 -3 0 at terminal state ⇡ 2 Q (st,at)=E rt+1 + rt+2 + rt+3 + ... ak = ⇡(sk) k>t | 8 A -7 -6 -5 -4 ⇡ ⇡ 2 Q (st,at)== EE⇥[rrtt+1+1++Qrt+2(st++1, ⇡r(ts+3t+1+)]... aBellmank = ⇡(sk (1957) k>t) ⇤ | 8 ⇥ ⇤ Playing Games: Deep Reinforcement Learning

• Stable algorithm for training NN towards Q • Sample past transitions: experience replay • Break correlations in data • Learn from all past policies • "Frozen" target Q-network to avoid oscillations

Acting at each iteration: Learning • agent is in a state s at each iteration • select action a: • sample tuple { s, a, s’, r } (or batch) • greedy: based on max Q(s,a,w) • update wrt target with old weights: • explore: random @ 2 • observe new state s’ and reward r r + max Q(s0,a0, w) Q(s, a, w) @w a0 • store in memory tuple { s, a, s’, r } ⇣ ⌘ • Periodically update ﬁxed weights w w

V. Mnih et al. "Human-level control through deep reinforcement learning." Nature (2015) Go!

Google’s AI AlphaGo beat 18-time world champion Lee Sedol in 2016. AlphaGo won 4 out of 5 games in the match

BBC (May 23, 2017): “Google’s AI takes on champion in fresh Go challenge” • … AlphaGo has won the ﬁrst of three matches it is playing against the world's number one Go player, Ke Jie • …Ke Jie said:

• “There were some unexpected moves and I was American Go Association: There are deeply impressed.” many more possible go games - 10 followed by more than 300 zeroes - • “I was quite shocked as there was a move that would than there are subatomic particles never happen in a human-to-human Go match." in the known universe. Reinforcement Learning: Actions & States

Actions: Turn and modulate velocity by controlling body deformation • Increase curvature

• Decrease curvature

θ States: • Orientation relative to leader: Δx, Δy, θ

• Time since previous tail beat: Δt Δy

Δx • Current shape of the body (manoeuvre) Reinforcement Learning: Reward

Goal #1: learn to stay behind the leader Reward: vertical displacement R<0

y R>0 R =1 | | y 0.5L

• Failure condition Rend = 1 • Stray too far or collide with leader

Goal #2: learn to maximise swimming-efﬁciency

Reward: efficiency Pthrust R⌘ = Pthrust + max(Pdef , 0) Thrust power Tu = Tu+ F udef dS T uCM · R⌘ = | | T uCM + max( F(x) udefR (x)dx, 0) | | @⌦ · Deformation power R Efficiency based (Aη) smart fish Dumb fish Smart fish • Aη stays right behind the leader • Logically, unsteady disturbances = trouble • But the agent seems to have figured out how to use the disturbances to its advantage • It synchronises the motion of its head with the lateral flow-velocity generated by wake-vortices. (More detailed explanation to come) • All energetics metrics reflect an advantage

• High efﬁciency, low CoT, lower PDef, higher PThrust θ

Δy What did they learn? Δx 0.8

• Relative vertical displacement (Δy/L) 0.6

• AΔy settles close to the center (Δy≈0) 0.4 y/L ∆ • Aη surprisingly, also Δy≈0 0.2

0 2.6 • Relative horizontal displacement (ΔxlL) 0 10 20 30 2.540 50 t 2.4 • AΔy does not seem extremely worried about this 2.3 x/L • Aη tries to maintain Δx=2.2L ∆ 2.2 2.1 • AΔy undergoes a lot more twists and turns than Aη 2 1.9 0 10 20 30 40 50 • Desperately trying to avoid penalty when Δy≠0 t • Incurs higher deformation power

• How does Aη’s behaviour evolve with training? • First 10,000 transitions • Last 10,000 transitions Swimming performance

• Comparing speed, efﬁciency, CoT, and Aη Sη power measurements AΔy SΔy for 4 distinct swimmers

Note: SΔy replays actions performed by

AΔy, but swims alone (likewise for Sη + Aη)

• Aη - best efﬁciency and best CoT (Cost of Transp.)

• Aη & AΔy - comparable velocity

• SΔy worst efﬁciency, CoT comparable to AΔy Swimming performance

• Aη - best efﬁciency and best CoT (Cost of Transp.) Aη Sη

• Aη & AΔy - comparable AΔy SΔy velocity

• SΔy worst efﬁciency, CoT comparable to AΔy Time to give Aη a PhD?

• Wake-vortices lift-up the wall- vorticity along the swimmer’s body

• Lifted vortices give rise to secondary vortices

• Secondary vortices - high speed region => suction due to low pressure

• This combines with body

deformation velocity to affect Pdef

16 Time to give Aη a PhD?

• Secondary vortices - high speed region => suction due to low pressure

• This combines with body

deformation velocity to affect Pdef

17 Energetics over a full tail-beat cycle

• Negative troughs Aη noticeable in mean PDef

• Also, higher PThrust in mid- body section

Sη • PDef rarely shows negative excursion in the absence of wake-vortices

18 Reacting to an erratic leader Note: Reward allotted here has no connection to relative displacement

Two ﬁsh swimming together in Greece Two ﬁsh swimming together in the Swiss supercomputer Surprisingly, Aη reacts intelligently to unfamiliar conditions

• Aη never experienced deviations in the leader’s behaviour during training • But analogous situations might have been encountered: random actions during training

• Aη can react appropriately to maximise the cumulative reward received

20 Let’s try to implement Aη’s strategy in 3D

• Identify target coordinates as the maximum points in the following velocity correlation:

• Implement PID controller that adjusts a follower’s undulations to maintain the target position speciﬁed

21 22 3D Wake interactions

• Wake-interactions beneﬁt the follower 1 • 11.6% increase in efﬁciency 0.8 • 5.3% reduction in CoT η

• Oncoming wake-vortex ring intercepted 0.6 • Generates a new ‘lifted-vortex’ ring (LR)

• Similar to the 2D case 0.4 17.5 18 18.5 19 19.5 • New ring modulates the follower’s efﬁciency, as it t proceeds down the body

LR LR 23 Summary

• Agents learn how to exploit unsteady ﬂuctuations in the velocity ﬁeld to their advantage • Large energetic savings, without loss in speed

• Wake-vortices: unmistakable Aη AΔy impact on the savings

• Implement control strategy in 3D simulations • 11% gain in efﬁciency 1 • Physical mechanism similar to 2D

0.8

Possible to swim autonomously η & energetically favourable? 0.6 LR 0.4 Yesss! 17.5 18 18.5 19 19.5 t 24 Backup Simulation Cost (2D)

• Wavelet-based adaptive grid

• https://github.com/cselab/MRAG-I2D Rossinelli et al., JCP (2015)

• Production runs (Re=5000) • Domain : [0,1] x [0,1] • Resolution : 8192 x 8192 • Training simulations (lower resolution) • 1600 points along ﬁsh midline • Resolution : 2048 x 2048 • Running with 24 threads (12 hyper • 10 tail-beat cycles : 36 core hours threaded cores - Piz Daint) • Learning converges in : 150,000 tail-beats • 10 tail-beat cycles : 27000 time steps • 0.54 Million core hrs per learning episode • Approx. 96 core hours: 1 second/step

26 Simulation Cost (3D) • Uniform grid Finite Volume solver

• Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado • Production runs (Re=5000) • Domain : [0, 1] x [0, 0.5] x [0, 0.25] • Resolution : 6144 x 3072 x 768 • 600 points along ﬁsh midline • Training simulations (lower resolution) • Running on 128 nodes, 24 threads each (Hybrid MPI + OpenMP : Piz Daint) • Resolution : 2048 x 1024 x 512 • 10 tail-beat cycles : 21000 time steps • Expected : 1.2 Million core hours per learning episode • Approx. 37,000 core hours: 3 seconds/timestep

27 FSI Algorithm details

n n n Refinement (! and/or u ) tr continuum t R | fluid flow 2 n = !n Velocity r n n Multipoles u =0 u = r · r⇥ @! 2 1 N +(u )! =(! )u + ⌫ ! + ⇢ p n n n 2 ⇢ = ⇢f + (⇢ ⇢f ) @t · r · r r ⇢ r ⇥r i i i=1 X fluid2solid solid2fluid t,n 1 n n n ui = ⇢ i u dx Projection mi ⌦ d(Msx) t r def Z = F u = uS u + u + u n dt ⌘ ˙ 1 n n n n ✓i = Ji ⇢ i (x x¯i ) u dx d(Is!) ⌦ ⇥ = ⌧ n Z dt ur,n = ✓˙ (x x¯n) i i ⇥ i @!n Stretching = !n un RK2 @t ·r n single-fluid model for FSI u + tuS n n Penalization u = ! = ! + (u u )

discretization 1+t r⇥ u =0 r · @!n ⇢n @un Explicit @! 2 Baroclinicity = r +(un )un ⌫ 2un g +(u )! =(! )u + ⌫ ! @t ⇢n ⇥ @t · r r Euler @t · r · r r ✓ ◆ ⇢ @u 2 n r +(u )u ⌫ u g @! 2 n ⇢ ⇥ @t · r r Diffusion = ⌫ ! RK2+LTS ✓ ◆ @t r + (uS u) n @! n n RK2+LTS / + u ! =0 particles t 1 @t ·r ui = ⇢iu dx Advection mi ⌦ n+1 n+1 Z ! = ! 1 t,n ✓˙ = J ⇢ (x x¯ ) u dx n+1 n n+1t,n nn nn+1 n n n Explicit i i i i xi = xi + ui t ˙ ⌦ ⇥ xi = xi + ui t , ✓i = ✓i + ✓i t Z n Euler r ˙ n+1 n ˙ n ui = ✓i (x x¯i) n+1 ✓i = ✓ +n+1✓i t n+1 ⇥ t Compression (! and/or u ) tc C | A ﬂexible maneuvering model

c c - Modiﬁed midline kinematics preserves travelling wave: { {

- Each action prescribes a point of the spline: Traveling wave Traveling spline c+¼ c+¾ c+½ c c+1 Increase Curvature Decrease Curvature c+½ c c+1 c+¼ c+¾ Examples

Effect of action depends on when action is made

Increasing local curvature

Reducing local curvature

Chain of actions RL with function approximation

Input Layer: Agent State

Memory Cells

Hidden Layers

Output Layer: Cumulative Reward Challenges for Reinforcement Learning: Approximate cumulative reward • Continuous, not discrete, states deﬁnition with Neural Network Best action to perform in each state: arg max NN(state, action) action

• Agent has partial knowledge of environment: Agent state ≠ Environment state state information is not enough to make optimal decisions

use Recurrent Neural Network with memory (LSTM): • Learns to remember information even if it becomes relevant much later • Creates own state representation by remembering history Recurrent Neural Network

(a1) (a2) (a3) (a4) (a5) qn qn qn qn qn

LSTM Layer 3

LSTM Layer 2

LSTM Layer 1

on Reinforcement Learning

AGENT • Mathematical framework for teaching a agents to make decisions • Agent makes actions that inﬂuence its environment (state) • Agent’s success is measured by (delayed) rewards • Agent has to learn a good control policy to choose actions action • Episodic description of the problem: state • Agent starts in state s0 reward • Perform action according to policy a0 = π(s0)

• Environment transitions with unknown distribution {s1, r1} ~ T (s0, a0)

• Agent might receive feedback: scalar reward (r1)

• Perform action according to policy a1 = π(s1) • …

T t • Goal: maximise discounted future rewards E rt [0, 1] 2 "t=0 # X ENVIRONMENT • Model free approach: obtain optimal policy π*(s0) without requiring a model for the environment’s transition probability T (s0, a0) The Value Function

⇡ 2 • The Value Function Q ( s t ,a t )= is Ea predictionrt+1 + rt +2of +future rt +3rewards+ ... thatak =can⇡( besk) obtainedk>t by: | 8 • Starting in state st, ⇥ ⇤ • Performing action at, • Thereafter following the policy π(s’). ⇡ 2 Q (st,at)=E rt+1 + rt+2 + rt+3 + ... ak = ⇡(sk) k>t | 8 ⇥ ⇤ • The value function can be unrolled in the Bellman equation (Bellman, 1957): ⇡ ⇡ 2 Q (st,at)== EE[rrtt+1+1++Qrt+2(st++1, ⇡r(ts+3t+1+)]... ak = ⇡(sk) k>t | 8 ⇥ ⇤

• The optimal Q is obtained by the policy which maximises the future rewards:

⇡⇤(s0) = arg max Q⇤(s0,a0) a0

Q⇤(st,at)=E rt+1 + max Q⇤(st+1,a0) a0 h i • Once we have Q*, the agent can make optimal decision by following π*(s) Temporal Diﬀerence Learning

• For any given policy, the Value function is a ﬁxed point of the Bellman equation: ⇡ ⇡ 2 Q (st,at)== EE[rrtt+1+1++Qrt+2(st++1, ⇡r(ts+3t+1+)]... ak = ⇡(sk) k>t | 8 ⇥ ⇤ • Value-based RL algorithms often rely on Temporal Difference (TD) Learning (Sutton, 1988) • Use past experiences with partially known environment to predict its future behavior • Introduce a sequence of approximate (parameterized) value Functions Qk(s,a) k+1 k Q (st,at)=E rt+1 + max Q (st+1,a0) a0 kh i lim Q (s, a)=Q⇤(s, a) k !1 • Q-Learning (Watkins, 1989): • Rather than computing the estimation, update Qk+1(s,a) to minimise TD residual:

• Update the parameters of the approximate Q by stochastic gradient descent (SGD) RL with Function Approximation

• Q-Learning converges when using a tabular approximation of Q(s,a) • Requires binning of the state/action space (challenging in high dimensional spaces) • The parameters to be updated are the entries in the table for each discrete (s, a)

• Other approaches to approximate the value function include: • Radial basis functions (Moody and Darken, 1989) • Feed-forward Neural Networks (Tesauro, 1994) • Recurrent Neural Networks / LSTM (Bakker, 2000)

• NN with at least 3 layers can approximate with desired accuracy any smooth function (Gorban 1998)

• Given the right weights w, a network Qw(s,a) can model an arbitrarily complex task • However, naïve Q-Learning does not in general converge with Neural Networks: • Episodic framework leads to correlated samples • Even small changes in the Q-function can lead to drastic changes in the policy • These changes can lead to oscillations in the TD target Deep Q-Networks

• The Deep Q-Networks (DQN) algorithm (Mnih et al, 2015) proposed two changes to Q-Learning: • To remove issue of correlation in sequential training data: • During training build a database of the agents’ experiences Acting s0,a0,r1,s1 , s1,a1,r2,s2 ,..., st 1,at 1,rt,st {{ } { } { }} at each iteration: • Sample from this database for each SGD update • agent is in a state s • select action a: • To stabilize TD target, introduced a target Q-network: • based on max Q(s,a,w) • Keep a copy of ‘frozen’ weights w- of the Q function • or random action • observe new state s’ and r • Compute the TD residual wrt target network: • store in data-set { s, a, s’, r } 2

= r + max Qw (s0,a0) Qw(s, a) L a0 • Copy from the⇣ weights w after some SDG updates⌘ Learning at each iteration • • One of the remaining challenges is that state s must: Sample tuple { s, a, s’, r } • Update Q-network with SGD step • Contain all relevant information about the environment • Periodically update weights of target • Enable the agent to make optimal decisions w w

Partially Observable Markov Decision Process (POMDP)

• MDP: state contains all relevant information on system to make optimal decisions a1 a2 a2 • Transition to new state only depends on current state and action s1 s2 s3 s4 P (st+1 st,at)=P (st+1 (st,at), (st 1,at 1),...) | |

a1 a2 a2 • POMDP: agent receives partial information about state of the environment • Agent receives observations (o) s1 s2 s3 s4 • Transition to new state depends on past history

P (ot+1 ot,at) = P (ot+1 (ot,at), (ot 1,at 1),...) | 6 | o1 o2 o3 o4

• Recurrent Neural Networks designed to learn long term time- dependencies: Qt Q0 Q1 Q2 • Credit-assignment through time: increase the weight of wr wr wr past information allowing network to show desired yt = y0 y1 y2 behaviour in the present wf wf wf wf • RNN employed in this work to approximate Q ot o0 o1 o2