Machine Learning for Synchronized Swimming

Machine Learning for Synchronised Swimming Siddhartha Verma, Guido Novati Petros Koumoutsakos FSIM 2017 May 25, 2017 CSElab Computational Science & Engineering Laboratory http://www.cse-lab.ethz.ch Motivation • Why do fish swim in groups? • Social behaviour? • Anti-predator benefits • Better foraging/reproductive opportunities • Better problem-solving capability • Propulsive advantage? Credit: Artbeats • Classical works on schooling: Breder (1965), Weihs (1973,1975), … • Based on simplified inviscid models (e.g., point vortices) • More recent simulations: consider pre-assigned and fixed formations Hemelrijk et al. (2015), Daghooghi & Borazjani (2015) • But: actual formations evolve dynamically Breder (1965) Weihs (1973), Shaw (1978) • Essential question: Possible to swim autonomously & energetically favourable? Hemelrijk et al. (2015) Vortices in the wake • Depending on Re and swimming- kinematics, double row of vortex rings in the wake • Our goal: exploit the flow generated by these vortices Numerical methods 0 in 2D • Remeshed vortex methods (2D) @! 2 + u ! = ! u + ⌫ ! + λ (χ (us u)) • Solve vorticity form of incompressible Navier-Stokes @t ·r ·r r r⇥ − Advection Diffusion Penalization • Brinkman penalization • Accounts for fluid-solid interaction Angot et al., Numerische Mathematik (1999) • 2D: Wavelet-based adaptive grid • Cost-effective compared to uniform grids Rossinelli et al., J. Comput. Phys. (2015) • 3D: Uniform grid Finite Volume solver Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado Interacting swimmers L L2/T • Simple model of fish schooling: a leader and follower Re = = 5000 T ⌫ Trailing fish’s head intercepts positive vorticity: velocity increases leading fish trailing fish Initial tail-to-head distance = 1 L Swimming in a wake can be beneficial or detrimental Trailing fish’s head intercepts negative vorticity: velocity decreases Initial tail-to-head distance = 1.25 L The need for control • Without control, trailing fish gets kicked out of leader’s wake • Manoeuvring through unpredictable flow field requires: • ability to observe the environment • knowledge about reacting appropriately • The swimmers need to learn how to interact with the environment! Reinforcement Learning: The basics • An agent learning the best action, through trial- and-error interaction with environment > Learning rate (alpha) is the equivalent • Actions have long term consequences of SOR: U(s) = (1-a) U(s) + a (R(s) + g U(s’)) > Discount factor (gamma) determines how far into the future we want to • Reward (feedback) is delayed account for Discount factor : This models the fact that future rewards are worth less than immediate • Goal rewards. “The discount factor γ {\displaystyle \gamma } \gamma determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 • Maximize cumulative future reward: will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1 {\displaystyle \gamma =1} \gamma =1, without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.[2] Even with a discount factor only slightly lower than 1, the Q-function learning leads to propagation of • Specify what to do, not how to do it errors and instabilities when the value function is approximated with an artificial neural network.[3] In that case, it is known that starting with a lower discount factor and increasing it towards its final value yields accelerated learning. [4]” • Backpropagation: A Example: Maze solving • Agent receives -11 -12 00 feedback State Agent’s position (A) • Expected reward -12 -11 -10 -2 -1 updated in previously Actions go U, D, L, R visited states -13 -9 -8 -2 • Now we have a policy -1 per step taken Reward -14 -15 -7 -3 0 at terminal state ⇡ 2 Q (st,at)=E rt+1 + γrt+2 + γ rt+3 + ... ak = ⇡(sk) k>t | 8 A -7 -6 -5 -4 ⇡ ⇡ 2 Q (st,at)== EE⇥[rrtt+1+1++γγQrt+2(st++1γ, ⇡r(ts+3t+1+)]... aBellmank = ⇡(sk (1957) k>t) ⇤ | 8 ⇥ ⇤ Playing Games: Deep Reinforcement Learning • Stable algorithm for training NN towards Q • Sample past transitions: experience replay • Break correlations in data • Learn from all past policies • "Frozen" target Q-network to avoid oscillations Acting at each iteration: Learning • agent is in a state s at each iteration • select action a: • sample tuple { s, a, s’, r } (or batch) • greedy: based on max Q(s,a,w) • update wrt target with old weights: • explore: random @ 2 • observe new state s’ and reward r r + γ max Q(s0,a0, w−) Q(s, a, w) @w a0 − • store in memory tuple { s, a, s’, r } ⇣ ⌘ • Periodically update fixed weights w− w V. Mnih et al. "Human-level control through deep reinforcement learning." Nature (2015) Go! Google’s AI AlphaGo beat 18-time world champion Lee Sedol in 2016. AlphaGo won 4 out of 5 games in the match BBC (May 23, 2017): “Google’s AI takes on champion in fresh Go challenge” • … AlphaGo has won the first of three matches it is playing against the world's number one Go player, Ke Jie • …Ke Jie said: • “There were some unexpected moves and I was American Go Association: There are deeply impressed.” many more possible go games - 10 followed by more than 300 zeroes - • “I was quite shocked as there was a move that would than there are subatomic particles never happen in a human-to-human Go match." in the known universe. Reinforcement Learning: Actions & States Actions: Turn and modulate velocity by controlling body deformation • Increase curvature • Decrease curvature θ States: • Orientation relative to leader: Δx, Δy, θ • Time since previous tail beat: Δt Δy Δx • Current shape of the body (manoeuvre) Reinforcement Learning: Reward Goal #1: learn to stay behind the leader Reward: vertical displacement R<0 ∆y R>0 R =1 | | ∆y − 0.5L • Failure condition Rend = 1 • Stray too far or collide with leader − Goal #2: learn to maximise swimming-efficiency Reward: efficiency Pthrust R⌘ = Pthrust + max(Pdef , 0) Thrust power Tu = Tu+ F udef dS T uCM · R⌘ = | | T uCM + max( F(x) udefR (x)dx, 0) | | @⌦ · Deformation power R Efficiency based (Aη) smart fish Dumb fish Smart fish • Aη stays right behind the leader • Logically, unsteady disturbances = trouble • But the agent seems to have figured out how to use the disturbances to its advantage • It synchronises the motion of its head with the lateral flow-velocity generated by wake-vortices. (More detailed explanation to come) • All energetics metrics reflect an advantage • High efficiency, low CoT, lower PDef, higher PThrust θ Δy What did they learn? Δx 0.8 • Relative vertical displacement (Δy/L) 0.6 • AΔy settles close to the center (Δy≈0) 0.4 y/L ∆ • Aη surprisingly, also Δy≈0 0.2 0 2.6 • Relative horizontal displacement (ΔxlL) 0 10 20 30 2.540 50 t 2.4 • AΔy does not seem extremely worried about this 2.3 x/L • Aη tries to maintain Δx=2.2L ∆ 2.2 2.1 • AΔy undergoes a lot more twists and turns than Aη 2 1.9 0 10 20 30 40 50 • Desperately trying to avoid penalty when Δy≠0 t • Incurs higher deformation power • How does Aη’s behaviour evolve with training? • First 10,000 transitions • Last 10,000 transitions Swimming performance • Comparing speed, efficiency, CoT, and Aη Sη power measurements AΔy SΔy for 4 distinct swimmers Note: SΔy replays actions performed by AΔy, but swims alone (likewise for Sη + Aη) • Aη - best efficiency and best CoT (Cost of Transp.) • Aη & AΔy - comparable velocity • SΔy worst efficiency, CoT comparable to AΔy Swimming performance • Aη - best efficiency and best CoT (Cost of Transp.) Aη Sη • Aη & AΔy - comparable AΔy SΔy velocity • SΔy worst efficiency, CoT comparable to AΔy Time to give Aη a PhD? • Wake-vortices lift-up the wall- vorticity along the swimmer’s body • Lifted vortices give rise to secondary vortices • Secondary vortices - high speed region => suction due to low pressure • This combines with body deformation velocity to affect Pdef 16 Time to give Aη a PhD? • Secondary vortices - high speed region => suction due to low pressure • This combines with body deformation velocity to affect Pdef 17 Energetics over a full tail-beat cycle • Negative troughs Aη noticeable in mean PDef • Also, higher PThrust in mid- body section Sη • PDef rarely shows negative excursion in the absence of wake-vortices 18 Reacting to an erratic leader Note: Reward allotted here has no connection to relative displacement Two fish swimming together in Greece Two fish swimming together in the Swiss supercomputer Surprisingly, Aη reacts intelligently to unfamiliar conditions • Aη never experienced deviations in the leader’s behaviour during training • But analogous situations might have been encountered: random actions during training • Aη can react appropriately to maximise the cumulative reward received 20 Let’s try to implement Aη’s strategy in 3D • Identify target coordinates as the maximum points in the following velocity correlation: • Implement PID controller that adjusts a follower’s undulations to maintain the target position specified 21 22 3D Wake interactions • Wake-interactions benefit the follower 1 • 11.6% increase in efficiency 0.8 • 5.3% reduction in CoT η • Oncoming wake-vortex ring intercepted 0.6 • Generates a new ‘lifted-vortex’ ring (LR) • Similar to the 2D case 0.4 17.5 18 18.5 19 19.5 • New ring modulates the follower’s efficiency, as it t proceeds down the body LR LR 23 Summary • Agents learn how to exploit unsteady fluctuations in the velocity field to

Machine Learning for Synchronized Swimming

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support