Machine Learning for Synchronised Swimming Siddhartha Verma, Guido Novati Petros Koumoutsakos FSIM 2017 May 25, 2017 CSElab Computational Science & Engineering Laboratory http://www.cse-lab.ethz.ch Motivation • Why do fish swim in groups? • Social behaviour? • Anti-predator benefits • Better foraging/reproductive opportunities • Better problem-solving capability • Propulsive advantage? Credit: Artbeats • Classical works on schooling: Breder (1965), Weihs (1973,1975), … • Based on simplified inviscid models (e.g., point vortices) • More recent simulations: consider pre-assigned and fixed formations Hemelrijk et al. (2015), Daghooghi & Borazjani (2015) • But: actual formations evolve dynamically Breder (1965) Weihs (1973), Shaw (1978) • Essential question: Possible to swim autonomously & energetically favourable? Hemelrijk et al. (2015) Vortices in the wake • Depending on Re and swimming- kinematics, double row of vortex rings in the wake • Our goal: exploit the flow generated by these vortices Numerical methods 0 in 2D • Remeshed vortex methods (2D) @! 2 + u ! = ! u + ⌫ ! + λ (χ (us u)) • Solve vorticity form of incompressible Navier-Stokes @t ·r ·r r r⇥ − Advection Diffusion Penalization • Brinkman penalization • Accounts for fluid-solid interaction Angot et al., Numerische Mathematik (1999) • 2D: Wavelet-based adaptive grid • Cost-effective compared to uniform grids Rossinelli et al., J. Comput. Phys. (2015) • 3D: Uniform grid Finite Volume solver Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado Interacting swimmers L L2/T • Simple model of fish schooling: a leader and follower Re = = 5000 T ⌫ Trailing fish’s head intercepts positive vorticity: velocity increases leading fish trailing fish Initial tail-to-head distance = 1 L Swimming in a wake can be beneficial or detrimental Trailing fish’s head intercepts negative vorticity: velocity decreases Initial tail-to-head distance = 1.25 L The need for control • Without control, trailing fish gets kicked out of leader’s wake • Manoeuvring through unpredictable flow field requires: • ability to observe the environment • knowledge about reacting appropriately • The swimmers need to learn how to interact with the environment! Reinforcement Learning: The basics • An agent learning the best action, through trial- and-error interaction with environment > Learning rate (alpha) is the equivalent • Actions have long term consequences of SOR: U(s) = (1-a) U(s) + a (R(s) + g U(s’)) > Discount factor (gamma) determines how far into the future we want to • Reward (feedback) is delayed account for Discount factor : This models the fact that future rewards are worth less than immediate • Goal rewards. “The discount factor γ {\displaystyle \gamma } \gamma determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 • Maximize cumulative future reward: will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1 {\displaystyle \gamma =1} \gamma =1, without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite.[2] Even with a discount factor only slightly lower than 1, the Q-function learning leads to propagation of • Specify what to do, not how to do it errors and instabilities when the value function is approximated with an artificial neural network.[3] In that case, it is known that starting with a lower discount factor and increasing it towards its final value yields accelerated learning. [4]” • Backpropagation: A Example: Maze solving • Agent receives -11 -12 00 feedback State Agent’s position (A) • Expected reward -12 -11 -10 -2 -1 updated in previously Actions go U, D, L, R visited states -13 -9 -8 -2 • Now we have a policy -1 per step taken Reward -14 -15 -7 -3 0 at terminal state ⇡ 2 Q (st,at)=E rt+1 + γrt+2 + γ rt+3 + ... ak = ⇡(sk) k>t | 8 A -7 -6 -5 -4 ⇡ ⇡ 2 Q (st,at)== EE⇥[rrtt+1+1++γγQrt+2(st++1γ, ⇡r(ts+3t+1+)]... aBellmank = ⇡(sk (1957) k>t) ⇤ | 8 ⇥ ⇤ Playing Games: Deep Reinforcement Learning • Stable algorithm for training NN towards Q • Sample past transitions: experience replay • Break correlations in data • Learn from all past policies • "Frozen" target Q-network to avoid oscillations Acting at each iteration: Learning • agent is in a state s at each iteration • select action a: • sample tuple { s, a, s’, r } (or batch) • greedy: based on max Q(s,a,w) • update wrt target with old weights: • explore: random @ 2 • observe new state s’ and reward r r + γ max Q(s0,a0, w−) Q(s, a, w) @w a0 − • store in memory tuple { s, a, s’, r } ⇣ ⌘ • Periodically update fixed weights w− w V. Mnih et al. "Human-level control through deep reinforcement learning." Nature (2015) Go! Google’s AI AlphaGo beat 18-time world champion Lee Sedol in 2016. AlphaGo won 4 out of 5 games in the match BBC (May 23, 2017): “Google’s AI takes on champion in fresh Go challenge” • … AlphaGo has won the first of three matches it is playing against the world's number one Go player, Ke Jie • …Ke Jie said: • “There were some unexpected moves and I was American Go Association: There are deeply impressed.” many more possible go games - 10 followed by more than 300 zeroes - • “I was quite shocked as there was a move that would than there are subatomic particles never happen in a human-to-human Go match." in the known universe. Reinforcement Learning: Actions & States Actions: Turn and modulate velocity by controlling body deformation • Increase curvature • Decrease curvature θ States: • Orientation relative to leader: Δx, Δy, θ • Time since previous tail beat: Δt Δy Δx • Current shape of the body (manoeuvre) Reinforcement Learning: Reward Goal #1: learn to stay behind the leader Reward: vertical displacement R<0 ∆y R>0 R =1 | | ∆y − 0.5L • Failure condition Rend = 1 • Stray too far or collide with leader − Goal #2: learn to maximise swimming-efficiency Reward: efficiency Pthrust R⌘ = Pthrust + max(Pdef , 0) Thrust power Tu = Tu+ F udef dS T uCM · R⌘ = | | T uCM + max( F(x) udefR (x)dx, 0) | | @⌦ · Deformation power R Efficiency based (Aη) smart fish Dumb fish Smart fish • Aη stays right behind the leader • Logically, unsteady disturbances = trouble • But the agent seems to have figured out how to use the disturbances to its advantage • It synchronises the motion of its head with the lateral flow-velocity generated by wake-vortices. (More detailed explanation to come) • All energetics metrics reflect an advantage • High efficiency, low CoT, lower PDef, higher PThrust θ Δy What did they learn? Δx 0.8 • Relative vertical displacement (Δy/L) 0.6 • AΔy settles close to the center (Δy≈0) 0.4 y/L ∆ • Aη surprisingly, also Δy≈0 0.2 0 2.6 • Relative horizontal displacement (ΔxlL) 0 10 20 30 2.540 50 t 2.4 • AΔy does not seem extremely worried about this 2.3 x/L • Aη tries to maintain Δx=2.2L ∆ 2.2 2.1 • AΔy undergoes a lot more twists and turns than Aη 2 1.9 0 10 20 30 40 50 • Desperately trying to avoid penalty when Δy≠0 t • Incurs higher deformation power • How does Aη’s behaviour evolve with training? • First 10,000 transitions • Last 10,000 transitions Swimming performance • Comparing speed, efficiency, CoT, and Aη Sη power measurements AΔy SΔy for 4 distinct swimmers Note: SΔy replays actions performed by AΔy, but swims alone (likewise for Sη + Aη) • Aη - best efficiency and best CoT (Cost of Transp.) • Aη & AΔy - comparable velocity • SΔy worst efficiency, CoT comparable to AΔy Swimming performance • Aη - best efficiency and best CoT (Cost of Transp.) Aη Sη • Aη & AΔy - comparable AΔy SΔy velocity • SΔy worst efficiency, CoT comparable to AΔy Time to give Aη a PhD? • Wake-vortices lift-up the wall- vorticity along the swimmer’s body • Lifted vortices give rise to secondary vortices • Secondary vortices - high speed region => suction due to low pressure • This combines with body deformation velocity to affect Pdef 16 Time to give Aη a PhD? • Secondary vortices - high speed region => suction due to low pressure • This combines with body deformation velocity to affect Pdef 17 Energetics over a full tail-beat cycle • Negative troughs Aη noticeable in mean PDef • Also, higher PThrust in mid- body section Sη • PDef rarely shows negative excursion in the absence of wake-vortices 18 Reacting to an erratic leader Note: Reward allotted here has no connection to relative displacement Two fish swimming together in Greece Two fish swimming together in the Swiss supercomputer Surprisingly, Aη reacts intelligently to unfamiliar conditions • Aη never experienced deviations in the leader’s behaviour during training • But analogous situations might have been encountered: random actions during training • Aη can react appropriately to maximise the cumulative reward received 20 Let’s try to implement Aη’s strategy in 3D • Identify target coordinates as the maximum points in the following velocity correlation: • Implement PID controller that adjusts a follower’s undulations to maintain the target position specified 21 22 3D Wake interactions • Wake-interactions benefit the follower 1 • 11.6% increase in efficiency 0.8 • 5.3% reduction in CoT η • Oncoming wake-vortex ring intercepted 0.6 • Generates a new ‘lifted-vortex’ ring (LR) • Similar to the 2D case 0.4 17.5 18 18.5 19 19.5 • New ring modulates the follower’s efficiency, as it t proceeds down the body LR LR 23 Summary • Agents learn how to exploit unsteady fluctuations in the velocity field to
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages38 Page
-
File Size-