Machine Learning for Synchronised Swimming
Siddhartha Verma, Guido Novati Petros Koumoutsakos
FSIM 2017 May 25, 2017
CSElab Computational Science & Engineering Laboratory http://www.cse-lab.ethz.ch Motivation
• Why do fish swim in groups? • Social behaviour? • Anti-predator benefits • Better foraging/reproductive opportunities • Better problem-solving capability
• Propulsive advantage? Credit: Artbeats
• Classical works on schooling: Breder (1965), Weihs (1973,1975), … • Based on simplified inviscid models (e.g., point vortices) • More recent simulations: consider pre-assigned and fixed formations Hemelrijk et al. (2015), Daghooghi & Borazjani (2015)
• But: actual formations evolve dynamically Breder (1965) Weihs (1973), Shaw (1978) • Essential question: Possible to swim autonomously & energetically favourable? Hemelrijk et al. (2015) Vortices in the wake
• Depending on Re and swimming- kinematics, double row of vortex rings in the wake
• Our goal: exploit the flow generated by these vortices Numerical methods
0 in 2D
• Remeshed vortex methods (2D) @! 2 + u ! = ! u + ⌫ ! + ( (us u)) • Solve vorticity form of incompressible Navier-Stokes @t ·r ·r r r⇥ Advection Diffusion Penalization • Brinkman penalization • Accounts for fluid-solid interaction Angot et al., Numerische Mathematik (1999)
• 2D: Wavelet-based adaptive grid
• Cost-effective compared to uniform grids Rossinelli et al., J. Comput. Phys. (2015)
• 3D: Uniform grid Finite Volume solver Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado Interacting swimmers L
L2/T • Simple model of fish schooling: a leader and follower Re = = 5000 T ⌫
Trailing fish’s head intercepts positive vorticity: velocity increases leading fish trailing fish
Initial tail-to-head distance = 1 L
Swimming in a wake can be beneficial or detrimental
Trailing fish’s head intercepts negative vorticity: velocity decreases
Initial tail-to-head distance = 1.25 L The need for control • Without control, trailing fish gets kicked out of leader’s wake
• Manoeuvring through unpredictable flow field requires: • ability to observe the environment • knowledge about reacting appropriately
• The swimmers need to learn how to interact with the environment! Reinforcement Learning: The basics
• An agent learning the best action, through trial- and-error interaction with environment
> Learning rate (alpha) is the equivalent • Actions have long term consequences of SOR: U(s) = (1-a) U(s) + a (R(s) + g U(s’)) > Discount factor (gamma) determines how far into the future we want to • Reward (feedback) is delayed account for Discount factor : This models the fact that future rewards are worth less than immediate • Goal rewards.
“The discount factor γ {\displaystyle \gamma } \gamma determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 • Maximize cumulative future reward: will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1 {\displaystyle \gamma =1} \gamma =1, without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.[2] Even with a discount factor only slightly lower than 1, the Q-function learning leads to propagation of • Specify what to do, not how to do it errors and instabilities when the value function is approximated with an artificial neural network.[3] In that case, it is known that starting with a lower discount factor and increasing it towards its final value yields accelerated learning. [4]”
• Backpropagation: A Example: Maze solving • Agent receives -11 -12 00 feedback State Agent’s position (A) • Expected reward -12 -11 -10 -2 -1 updated in previously Actions go U, D, L, R visited states -13 -9 -8 -2 • Now we have a policy -1 per step taken Reward -14 -15 -7 -3 0 at terminal state ⇡ 2 Q (st,at)=E rt+1 + rt+2 + rt+3 + ... ak = ⇡(sk) k>t | 8 A -7 -6 -5 -4 ⇡ ⇡ 2 Q (st,at)== EE⇥[rrtt+1+1++ Qrt+2(st++1 , ⇡r(ts+3t+1+)]... aBellmank = ⇡(sk (1957) k>t) ⇤ | 8 ⇥ ⇤ Playing Games: Deep Reinforcement Learning
• Stable algorithm for training NN towards Q • Sample past transitions: experience replay • Break correlations in data • Learn from all past policies • "Frozen" target Q-network to avoid oscillations
Acting at each iteration: Learning • agent is in a state s at each iteration • select action a: • sample tuple { s, a, s’, r } (or batch) • greedy: based on max Q(s,a,w) • update wrt target with old weights: • explore: random @ 2 • observe new state s’ and reward r r + max Q(s0,a0, w ) Q(s, a, w) @w a0 • store in memory tuple { s, a, s’, r } ⇣ ⌘ • Periodically update fixed weights w w
V. Mnih et al. "Human-level control through deep reinforcement learning." Nature (2015) Go!
Google’s AI AlphaGo beat 18-time world champion Lee Sedol in 2016. AlphaGo won 4 out of 5 games in the match
BBC (May 23, 2017): “Google’s AI takes on champion in fresh Go challenge” • … AlphaGo has won the first of three matches it is playing against the world's number one Go player, Ke Jie • …Ke Jie said:
• “There were some unexpected moves and I was American Go Association: There are deeply impressed.” many more possible go games - 10 followed by more than 300 zeroes - • “I was quite shocked as there was a move that would than there are subatomic particles never happen in a human-to-human Go match." in the known universe. Reinforcement Learning: Actions & States
Actions: Turn and modulate velocity by controlling body deformation • Increase curvature
• Decrease curvature
θ States: • Orientation relative to leader: Δx, Δy, θ
• Time since previous tail beat: Δt Δy
Δx • Current shape of the body (manoeuvre) Reinforcement Learning: Reward
Goal #1: learn to stay behind the leader Reward: vertical displacement R<0