<<

applied sciences

Article Study on Reinforcement Learning-Based Guidance Law

Daseon Hong , Minjeong Kim and Sungsu Park *

Department of Aerospace Engineering, Sejong University, Seoul 05006, Korea; [email protected] (D.H.); [email protected] (M.K.) * Correspondence: [email protected]; Tel.: +82-2-3408-3769

 Received: 18 August 2020; Accepted: 17 September 2020; Published: 20 September 2020 

Abstract: Reinforcement learning is generating considerable interest in terms of building guidance law and solving optimization problems that were previously difficult to solve. Since reinforcement learning-based guidance laws often show better robustness than a previously optimized algorithm, several studies have been carried out on the subject. This paper presents a new approach to training law by reinforcement learning and introducing some notable characteristics. The novel missile guidance law shows better robustness to the controller-model compared to the proportional navigation guidance. The neural network in this paper has identical inputs with proportional navigation guidance, which makes the comparison fair, distinguishing it from other research. The proposed guidance law will be compared to the proportional navigation guidance, which is widely known as quasi-optimal of missile guidance law. Our work aims to find effective missile training methods through reinforcement learning, and how better the new method is. Additionally, with the derived policy, we contemplated which is better, and in which circumstances it is better. A novel methodology for the training will be proposed first, and the performance comparison results will be continued therefrom.

Keywords: missile guidance law; robustness; linear system; reinforcement learning; control model tolerance

1. Introduction The proportional navigation guidance (PNG) is known as a quasi-optimal of interception guidance law. The simple concept of it can be found in nature. The authors of [1] show that it is possible to explain predatory flies’ pursuit trajectories with the PNG, in which a rotation rate of a pursuer can be modeled as a multiplication of specific constant and line-of-sight (LOS) rate. In [2], the authors also describe the terminal attack trajectories of peregrine falcons as PNG. The head of a falcon acts as a gimbal of missile seeker, guiding its body like a modern missile does. That shows centuries of training and evolution possibly brought those creatures to fly with quasi-optimal guidance and gives us an insight to see those natural phenomena with engineered mechanisms on the same boundary. Reinforcement learning (RL)-based training has a similar process. Each generation of an animal and an RL agent act stochastically, have some mutational elements, and follow what has a better reward, and aims for the optimal. Meanwhile, some studies on RL-based missile guidance law have and have carried on. Accordingly, Gaudet [3] had suspected an RL-based guidance law may help the logic itself to become robust. He proposed an RL framework that has the ability to build the guidance law for the homing phase. He trained Single-Layer Perceptron (SLP) by the stochastic search method to build a guidance law. Reinforcement learning-based missile guidance law surprisingly works well on the noisy stochastic world, possibly because its algorithm is based on probability. He brought PNG and Enhanced PNG as the comparison targets and showed that the RL generated guidance law has

Appl. Sci. 2020, 10, 6567; doi:10.3390/app10186567 www.mdpi.com/journal/applsci Appl. Sci. 2020, 10, 6567 2 of 13 better performance in terms of energy consumption and miss distance than the comparison targets concerning the stochastic environment. Additionally, in another paper [4], he proposed a framework for 3-dimensional RL-based guidance law design for an interceptor trained via the Proximal Policy Optimization (PPO) [5] algorithm, and showed by comparative analysis that the derived guidance law also has better performance and efficiency than the augmented Zero-Effort Miss policy [6]. However, previous research on RL-based missile guidance law compared policies that have a state of unidentical input, so the comparison is deemed unfair. Optimization of RL policy depends on the reward, which is effective when the reward can indicate which policy and single-action is better than the others. Besides, the missile guidance environment we are dealing with makes it harder for RL agents to acquire an appropriate reward. For this kind of environment, called a sparse reward environment, some special techniques were introduced. The simplest way to deal with such a sparse reward environment is by giving an agent a shaped reward. While general rewards are formulated with a binary state, e.g., success or not, shaped reward implies the information of the goal, e.g., formulate reward as a function of the goal. The authors of [7] proposed an effective reward formulation methodology that is in accordance with the optimization delay due to the sparse reward. A sparse reward can hardly show what is better, so it makes the reward validity weak and training slow. Hindsight Experience Replay (HER) is a replay technique that makes the failed experience an accomplished one. DDPG [8] combined with this technique shows faster and more stable training convergence than the original DDPG training method. HER has pumped up the learning performance with binary reward by mimicking human learning that we learn not only from accomplishments but even from failures. This paper presents a novel training framework of missile guidance law and compares trained guidance law with PNG. Additionally, a comparison of the control-system-applied environment will be shown since the overall performance or efficiency of a missile is not only affected by its guidance logic but is affected by the dynamic system as well. The control system which connects guidance command to reality always has a delay. In missile guidance, a system can possibly make the missile miss the target and consume much more energy than expected. There is research on missile guidance to make it robust on uncertainties and diverse parameters. In [9], Gurfil presents a guidance law that works well even in an environment with missile parametric uncertainties. He focused on building a universal missile guidance law that even covers having different systems. i.e., the purpose is making missiles work well with zero miss distance guidance even if they have various flight conditions. The result he showed was assuring zero miss distance for uncertainties. Additionally, due to the simplicity of the controller that the guidance law needs, proposed guidance law has simplicity in implementation. The paper will first, in the problem formulation, define environments and engagement scenarios, and overviews of PNG and training methods will be continued therefrom. Finally, simulation results and conclusion will follow.

2. Problem Formulation

2.1. Equations of Motion The most considerable location for a homing missile is the inside of the threshold that switches the INS (Inertial Navigation System)—based-guidance phase into the homing phase, which is the inner space of the gradation-filled boundary in Figure1. φ0 is the initial approaching angle, and the angular initial conditions for simulations will be set by using a randomly generated approaching angle as a constraint condition, which is as follows: ( φ + π, (i f φ < π) λ = 0 0 (1) 0 φ π, (i f φ π) 0 − 0 ≥

ψ0 = λ0 + L0 (2) Appl. Sci. 2020, 10, 6567 3 of 13 Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 12

FigureFigure 1.1. Geometry ofof aa planarplanar engagementengagement scenario.scenario.

TheThe headingheading ofof thethe missilemissile isis assumedassumed toto bebe parallelparallel toto thethe velocityvelocity vector,vector, andand thethe accelerationacceleration axisaxis isis attached attached perpendicular perpendicular to the to missilethe missile velocity ve vector.locity Itvector. is assumed It is thatassumed there isthat no aerodynamicthere is no eaerodynamicffect nor dynamic effect coupling. nor dynamic That coupling. is, the mass That point is, dynamicthe mass model point willdynamic be applied. modelThe will equations be applied. of motionThe equations are as follows: of motion are as follows:

→ → R = Rc , Vm = Vm (3) 𝑅 =𝑅 ⃗ , 𝑉 = 𝑉⃗ (3) " #  .   det→𝑉⃗.→⋮ 𝑅⃗ det Vm.Rc  𝐿 =𝜆−𝜓= 𝑎𝑟𝑐𝑡𝑎𝑛  (4)  𝑉⃗ ⋅ 𝑅⃗  L = λ ψ = arctan  (4) −  → →  𝑎 Vm Rc  𝜓 =  ·  (5) 𝑉 . am ψ𝑉= (5) Ω = 𝑠𝑖𝑛V(𝜆 −𝜓) (6) 𝑅 m V where 𝑅 is the relative distance from theΩ missile= m sin to( theλ target,ψ) 𝜆 is the (LOS) angle,(6) 𝜓 R − is the heading of the missile, 𝐿 is look-angle, 𝑉 is the speed of the missile, and 𝑎 is the R whereaccelerationis the applied relative to distance the missile. from The the missileacceleration to the of target, a missileλ is theis assumed line of sight to have (LOS) a angle,reasonableψ is L V a thelimitation heading of of −10𝑔 the missile, to 10𝑔, whereis look-angle, 𝑔 is the mgravitationalis the speed acceleration. of the missile, The and symbolm is the‘:’ between acceleration two applied to the missile. The acceleration of a missile is assumed to have a reasonable limitation of 10g column vectors indicates a concatenation that joins two vectors into a square matrix. − g g to 10 The, where Zero-Effort-Missis the gravitational (ZEM) is acceleration. an expected The miss symbol distance ‘:’ between if there twowill columnbe no further vectors maneuver indicates afrom concatenation the current that location joins two [10]. vectors The concept into a square of this matrix. will be used to interpolate the minimum miss distanceThe Zero-Eat the ffterminalort-Miss state (ZEM) to iscompensate an expected for miss the distance discreteness if there of willthe besimulation. no further The maneuver ZEM* is from the theZEM current when locationthe last [maneuver10]. The concept command of this is applied, will be used and towe interpolate defined it as the shown minimum in Figure miss distance2. at the terminal state to compensate for the discreteness of the simulation. The ZEM* is the ZEM when the last maneuver command is applied, and we defined it as shown in Figure2.

→ → → Rd = R f 1 R f (7) − − " #  .   → .→  det R f 1.Rd   −  →   ZEM∗ = Rd sin arctan  (8) Figure 2. Diagram of Zero-Effort-Miss  → (ZEM)*→ definition.  R f 1 Rd   − ·  𝑅⃗ = 𝑅⃗ − 𝑅⃗ → → → → (7) where R f and R f 1 are Rc at the final state and Rc at the very former state, respectively. − det𝑅⃗ ⋮ 𝑅⃗ ∗ 𝑍𝐸𝑀 =𝑅⃗ 𝑠𝑖𝑛 𝑎𝑟𝑐𝑡𝑎𝑛 (8) 𝑅⃗ ⋅ 𝑅⃗

where 𝑅⃗ and 𝑅⃗ are 𝑅⃗ at the final state and 𝑅⃗ at the very former state, respectively.

2.2. Engagement Scenario This paper works with an enemy target ship and a friendly anti-ship missile. A scenario of two- dimensional planar engagement will be discussed. The guidance phase of the scenario is switched by Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 12

Figure 1. Geometry of a planar engagement scenario.

The heading of the missile is assumed to be parallel to the velocity vector, and the acceleration axis is attached perpendicular to the missile velocity vector. It is assumed that there is no aerodynamic effect nor dynamic coupling. That is, the mass point dynamic model will be applied. The equations of motion are as follows:

𝑅 =𝑅⃗ , 𝑉 =𝑉⃗ (3)

det𝑉⃗ ⋮ 𝑅⃗ 𝐿 =𝜆−𝜓= 𝑎𝑟𝑐𝑡𝑎𝑛 (4) 𝑉⃗ ⋅ 𝑅⃗

𝑎 𝜓 = (5) 𝑉

𝑉 Ω = 𝑠𝑖𝑛(𝜆 −𝜓) (6) 𝑅 where 𝑅 is the relative distance from the missile to the target, 𝜆 is the line of sight (LOS) angle, 𝜓 is the heading of the missile, 𝐿 is look-angle, 𝑉 is the speed of the missile, and 𝑎 is the acceleration applied to the missile. The acceleration of a missile is assumed to have a reasonable limitation of −10𝑔 to 10𝑔, where 𝑔 is the gravitational acceleration. The symbol ‘:’ between two column vectors indicates a concatenation that joins two vectors into a square matrix. The Zero-Effort-Miss (ZEM) is an expected miss distance if there will be no further maneuver from the current location [10]. The concept of this will be used to interpolate the minimum miss Appl.distance Sci. 2020 at ,the10, 6567terminal state to compensate for the discreteness of the simulation. The ZEM* is4 of the 13 ZEM when the last maneuver command is applied, and we defined it as shown in Figure 2.

Figure 2. Diagram of Zero-Effort-Miss Zero-Effort-Miss (ZEM)* definition.definition.

2.2. Engagement Scenario 𝑅⃗ = 𝑅⃗ − 𝑅⃗ (7) This paper works with an enemy target ship and a friendly anti-ship missile. A scenario of Appl. Sci. 2020, 10, x FOR PEER REVIEW det𝑅⃗ ⋮ 𝑅⃗ 4 of 12 ∗ two-dimensional planar engagement𝑍𝐸𝑀 =𝑅 will⃗ 𝑠𝑖𝑛 be𝑎𝑟𝑐𝑡𝑎𝑛 discussed. The guidance phase of the scenario is switched(8) 𝑅⃗ ⋅ 𝑅⃗ bythe the distance distance between between the the target target and and the the missile. missile. The The missile missile launches launches at at an an arbitrary spot farfar awayaway fromwherefrom the the 𝑅⃗ target, target,and which𝑅⃗ which are is guidedis 𝑅 ⃗guided at bythe the byfinal Inertialthe state Inertial Navigationand Navigation𝑅⃗ at the System very System (INS)-aidedformer (INS)-aided state, guidance respectively. guidance until it crossesuntil it thecrosses switching the switching threshold. threshold. When the When missile the approaches missile approaches closer than closer a specific than a threshold specific threshold (16,500 m (16,500 in this scenario),2.2.m in Engagement this thescenario), missile Scenario the switches missile its switches guidance its phase guidance into aphase seeker-based into a seeker-based homing phase. homing The conceptualphase. The schemeconceptual of phase scheme switching of phase is switching depicted inis Figuredepicted3. Lin0 Figureis the look 3. 𝐿 angle is the at look the threshold. angle at the threshold. This paper works with an enemy target ship and a friendly anti-ship missile. A scenario of two- dimensional planar engagement will be discussed. The guidance phase of the scenario is switched by

FigureFigure 3.3. PlanarPlanar engagementengagement scenario.scenario.

TheThe reasonreason the the missile missile needs needs its phaseits phase switched switched is because is because INS-aided INS-aided guidance guidance is not that is accuratenot that noraccurate precise. nor It precise. acts as toIt acts place as the to missileplace the at miss the positionile at the that position makes that the makes homing the guidance homing available.guidance Thus,available. the zero-e Thus,ff theort-miss zero-effort-miss (ZEM) can be(ZEM) quite can big be at quite the point big at on the the point switching on the threshold, switching so threshold, the wide rangeso the of wide initial range conditions of initial for theconditions homing for phase the is homing needed. phase Additionally, is needed. generation Additionally, criteria generation for initial conditionscriteria for areinitial described conditions in Table are 1described. in Table 1.

TableTable 1.1.Initial Initial conditionsconditions ofof thethe hominghoming phase.phase. R (m) L (rad) (rad) V (m s) 𝑹threshold𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅 (𝒎) 𝑳0𝟎 (𝒓𝒂𝒅) 𝝓φ𝟎0 (𝒓𝒂𝒅) 𝑽m𝒎 (𝒎/𝒔)/ 16,16,500 500 [[−𝜋,π, π][ 𝜋] [0,0,2 2𝜋]π][ [200,200, 600 600]] − In practice, there is a delay that disturbs the derived action command which prevents the action In practice, there is a delay that disturbs the derived action command which prevents the action from occurring immediately. Eventually, there will be a gap between guidance law and real actuation; from occurring immediately. Eventually, there will be a gap between guidance law and real actuation; the missile consumes much energy and has less accuracy. The scenario must contain this type of the missile consumes much energy and has less accuracy. The scenario must contain this type of hindrance, and the simulations for that should be implemented. The details will be discussed in hindrance, and the simulations for that should be implemented. The details will be discussed in Section 2.3. Section 2.3.

2.3.2.3. ControllerController ModelModel TheThe controllercontroller model model can can be be described described as aas di ffa erentialdifferential equation equation of the of 1st the order 1st system.order system. We simply We definesimply them define as them follows: as follows: . τ𝜏a𝑎m ++a𝑎m==a𝑎c (9)(9) wherewhereτ is𝜏 theis the time time constant, constant,am is the𝑎 derivedis the derived actuation, actuation, and ac is theand acceleration 𝑎 is the acceleration command, respectively. command, respectively.

3. Proportional Navigation Guidance Law Proportional navigation guidance (PNG) is the most widely used guidance law on homing missiles with an active seeker [2,11,12]. PNG generates an acceleration command by multiplying the LOS rate and velocity with a predesigned gain. The mechanism reduces the look-angle and guides the missile to hit the target. The acceleration command by PNG in the planar engagement scenario is as follows:

𝑎 =𝑁𝑉Ω (10) where 𝑁 is navigation constant (design factor), 𝑉 is the speed of the missile, and Ω is the LOS rate. Figure 4 shows the conceptual diagram of PNG. Appl. Sci. 2020, 10, 6567 5 of 13

3. Proportional Navigation Guidance Law Proportional navigation guidance (PNG) is the most widely used guidance law on homing missiles with an active seeker [2,11,12]. PNG generates an acceleration command by multiplying the LOS rate and velocity with a predesigned gain. The mechanism reduces the look-angle and guides the missile to hit the target. The acceleration command by PNG in the planar engagement scenario is as follows:

ac = NVmΩ (10) where N is navigation constant (design factor), Vm is the speed of the missile, and Ω is the LOS rate. Appl.Figure Sci.4 2020shows, 10, x the FOR conceptual PEER REVIEW diagram of PNG. 5 of 12

Figure 4. Conceptual diagram of proportional navigation guidance.

Meanwhile, Pure Pure PNG, PNG, which which is isone one of ofthe the branches branches of PNG, of PNG, is adopted is adopted in this in paper. this paper. Pure PNG Pure isPNG used is for used a missile for a missile to hit toa relatively hit a relatively slow target slow target like a likeship. a ship.The acceleration The acceleration axis of axis theof Pure the PNG Pure isPNG attached is attached perpendicular perpendicular to the to missile the missile velocity velocity vector vector as it asis shown it is shown in Figure in Figure 1. Additionally,1. Additionally, the velocitythe velocity term term in Equation in Equation (10) of (10) Pure of PurePNG PNGis the isinertial the inertial veloci velocityty of the ofmissile. the missile. PNG isPNG very isuseful, very anduseful, it basically and it basically is even optimal is even optimalfor missile for guidance. missile guidance. Technically, Technically, creating creatinga missile a guidance missile guidance law that haslaw thatbetter has performance better performance than PNG, than despite PNG, despite sharing sharing the same the same engagement engagement scenario scenario that that PNG PNG is specializedis specialized for, for, is isnot not possible possible under under ideal ideal circ circumstances.umstances. However, However, PNG PNG is is not not optimized for practice. Thus, we will figure figure out whether RL-based is able to exceed the performance of PNG when the simulation environment gets more practical, i.e.,i.e., applying a controller model. Of course, we did not apply the practical model to training but applied themthem toto thethe simulationsimulation part.part. The overall structure of the PNG missile simulationsimulation isis shownshown inin FigureFigure5 5..

Figure 5. MissileMissile simulation simulation schematic diagram wi withth proportional navigation guidance. 4. Reinforcement Learning 4. Reinforcement Learning Prior research on RL-based missile guidance was studied under a very limited environment and Prior research on RL-based missile guidance was studied under a very limited environment and did not show whether a single network can cover a wide environment and initial conditions which did not show whether a single network can cover a wide environment and initial conditions which push the seeker to the hardware-limit or not. Additionally, the comparison had been implemented push the seeker to the hardware-limit or not. Additionally, the comparison had been implemented under an unfair condition for the comparison target, which each of them does not take identical states. under an unfair condition for the comparison target, which each of them does not take identical states. In this paper, it is demonstrated that the RL-based guidance is able to cover a wide environment under In this paper, it is demonstrated that the RL-based guidance is able to cover a wide environment plausible wide initial conditions, and with various system models. In this section, a novel structure under plausible wide initial conditions, and with various system models. In this section, a novel and methodology for training will be proposed. We adopted a Deep Deterministic Policy Gradient structure and methodology for training will be proposed. We adopted a Deep Deterministic Policy (DDPG)-based algorithm [8]. DDPG is a model-free reinforcement learning algorithm that generates Gradient (DDPG)-based algorithm [8]. DDPG is a model-free reinforcement learning algorithm that deterministic action in continuous space, which makes off-policy training available. The NN we used generates deterministic action in continuous space, which makes off-policy training available. The was optimized via gradient descent optimization. Since the environment we are dealing with is vast NN we used was optimized via gradient descent optimization. Since the environment we are dealing and can be described as a sparse reward environment, conventional binary rewards nor even a simple with is vast and can be described as a sparse reward environment, conventional binary rewards nor shaped reward do not really help to accelerate training. Thus, the algorithm we used in this paper even a simple shaped reward do not really help to accelerate training. Thus, the algorithm we used was slightly modified by making every step of an episode get an equational final score of the episode in this paper was slightly modified by making every step of an episode get an equational final score of the episode as a reward. It helps the agent to effectively consider the reward of the far future.

Figure 6 is an overview of the structure we designed. The agent feeds Ω and 𝑉 as input and stores a transition dataset, which consists of state, action, reward, and next state into a replay buffer. Appl. Sci. 2020, 10, 6567 6 of 13

as a reward. It helps the agent to effectively consider the reward of the far future. Figure6 is an overview of the structure we designed. The agent feeds Ω and Vm as input and stores a transition Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 12 Appl. Sci.dataset,Appl. 2020 ,Sci. 10, 2020 whichx FOR, 10 PEER, consistsx FOR REVIEW PEER of REVIEW state, action, reward, and next state into a replay buffer. 6 of 12 6 of 12

Figure 6. Structure of the training algorithm. Figure 6.Figure Structure 6. Structure of the training of the training algorithm. algorithm. Developed policy gradient algorithm using an artificial neural network (ANN) needs two DevelopedDeveloped policy policy policygradient gradient gradient algorithm algorithm algorithm using using anusing an artificial artificial an artificial neural neural neural networknetwork network (ANN) (ANN) needs(ANN) needs two needstwo separated two separated ANNs which are the actor that estimates the action and the critic that estimates the value. separatedANNsseparated ANNs which ANNswhich are arewhich the the actor are actor that the that estimatesactor estimates that theestimates th actione action th ande andaction the the critic and critic thatthe that critic estimates estimates that estimates the the value. value. the Figure value.7 Figure 7 shows each structure. DDPG, generally, calculates its critic loss from the output of the critic Figure showsFigure7 shows each7 showseach structure. structure. each structure. DDPG, DDPG, generally, DDPG, generally, generally, calculates calculates calculates its its critic critic lossits loss critic from from loss the the from output output the of outputof the the critic criticof the network critic network and the reward of a randomly sampled minibatch, and updates the critic and the actor every networkandnetwork and the the reward and reward the of reward of a randomlya randomly of a randomly sampled sampled sampled minibatch, minibatch, mini and andbatch, updates updates and updates thethe criticcritic the andand critic thethe and actor the every actor singleevery single step. Q is the action value, and is the hyperbolic tangent activation function. single step.single Q Q step. is is the the Q action actionis the value,action value, andvalue, and and is is the the hyperbolic hyperbolic is the hyperbolic tangenttangent tangent activationactivation activation function.function. function.

Figure 7. Actor-network and critic-network. Figure 7.Figure Actor-network 7. Actor-network and critic-network. and critic-network. Both consistconsist ofof threethree steps steps of of hidden hidden layers, layers, and and actor-network actor-network has has10 ±10 g ofg clippingof clipping step step as weas we set. Both consistBoth consist of three of steps three of steps hidden of hidden layers, layers,and actor-network and actor-network has ±10 has g± of ±10 clipping g of clipping step as step we as we Weset. usedWe used the Huber the Huber function function [13] as [13] a loss as functiona loss function that shows that highshows robustness high robustness to outlier to and outlier excellent and set. Weset. used We the used Huber the Huberfunction function [13] as [13]a loss as functiona loss function that shows that showshigh robustness high robustness to outlier to outlierand and optimizationexcellent optimization performance. performanc The equatione. The isequation as follows: is as follows: excellentexcellent optimization optimization performanc performance. The equatione. The equation is as follows: is as follows: 1  1 n 1 2 1 X 1((𝑦y i−𝑓f(𝑥(xi)))) ,, i 𝑖𝑓 f y |𝑦i −f (𝑓x(𝑥i) )| ≤𝛿δ  1 (𝑦− 𝑓2((𝑥𝑦)−) 𝑓, (𝑥 ) ) , 𝑖𝑓 | 𝑦 − 𝑖𝑓𝑓(𝑥 |𝑦)−| ≤𝛿𝑓(𝑥 ) | ≤𝛿 Lδ(y, 1f (x)) =1 2  − 1  − ≤  (11) 𝐿 𝑦, 𝑓𝐿(𝑥)𝑦,= 𝑓(𝑥)= n2 2δ y f (x ) δ , otherwise  (11) 𝐿𝑦, 𝑓(𝑥)=𝑛 i=1  i i 21 (11) (11) 𝑛 | − 1 | − 1 𝑛 𝛿|𝑦 −𝛿𝑓(𝑥𝑦 )−| −𝑓(𝑥𝛿) − , 𝛿𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝛿|𝑦 − 𝑓(𝑥2 )| − 2 𝛿 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 The training proceeds with a designed missile environment.2 The environment terminates the The training proceeds with a designed missile environment. The environment terminates the Theepisode trainingThe iftraining itproceeds encounters proceeds with one a withde orsigned more a de signed terminationmissile missile environment. conditions environment. The as environment follows—Out The environment terminates of Range terminates condition:the the episode if it encounters one or more termination conditions as follows—Out of Range condition: episodetargetepisode if it informationencounters if it encounters one unavailable or onemore or since terminationmore the termination target conditions is outside conditions as of follows—Out the as available follows—Out of field Range of of view Rangecondition: of thecondition: missile target information unavailable since the target is outside of the available field of view of the missile target informationseeker;target information Time unavailable Out condition: unavailable since time the since target exceeds the istarget theoutside maximum is ou oftside the available timestepof the available fiel whichd of fiel isview 400d of of s; view therethe missileof is the also missile a Hit seeker; Time Out condition: time exceeds the maximum timestep which is 400 s; there is also a Hit seeker;conditionseeker; Time Out Time ( icondition: f Out R < condition:5): judgedtime exceeds time that exceeds the the missile maximu the hitsmaximum thetimestep target.m timestep which However, whichis 400 the s;is there400 last s; conditionis there also isa Hitalso is not a Hit for condition (𝑖𝑓 𝑅): <5 judged that the missile hits the target. However, the last condition is not for conditiontermination,condition (𝑖𝑓 (𝑖𝑓𝑅): <5but judged exists𝑅): <5 judged that as an the observable that missile the missile hits data the that hits target. decides the target.However, whether However, the policylast the condition islast proper condition is for not the foris missionnot for termination, but exists as an observable data that decides whether the policy is proper for the mission termination,ortermination, not. but As exists an but episode as exists an observable ends,as an observable each data of the that data transition decides that decides whether datasets whether the gets policy the the corresponding ispolicy proper is proper for the reward formission the which mission is or not. As an episode ends, each of the transition datasets gets the corresponding reward which is the or not. orAs not. an episode As an episode ends, each ends, of each the transition of the transition datasets datasets gets the gets correspon the corresponding rewardding reward which iswhich the is the final integrated reward of the episode. We estimated the closest distance between a missile and a final integratedfinal integrated reward reward of the episode.of the episode. We estima We tedestima theted closest the closestdistance distance between between a missile a missile and a and a target to interpolate the closest approach in a discrete environment as follows: target totarget interpolate to interpolate the closest the closestapproach approach in a discrete in a discrete environment environment as follows: as follows: ∗ ∗ 𝑍𝐸𝑀 , 𝑖𝑓 𝐻𝑖𝑡 𝑍𝐸𝑀 𝑍𝐸𝑀 ∗, 𝑖𝑓 𝐻𝑖𝑡 , 𝑖𝑓 𝐻𝑖𝑡 𝑅 𝑅= = ⃗ (12) (12) 𝑅 𝑅⃗= 𝑅 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (12) 𝑅⃗ , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ⃗ ⃗ ∗ where𝐻𝑖𝑡 𝐻𝑖𝑡 is when Hit condition is𝑅 ⃗true, 𝑅𝑅⃗ is 𝑅 at the final 𝑍𝐸state,𝑀∗ 𝑍𝐸𝑀∗ is defined in Equation where where is when𝐻𝑖𝑡 is Hitwhen condition Hit condition is true, is true, is 𝑅⃗ atis the𝑅⃗ final at the state, final state, is𝑍𝐸 defined𝑀 is defined in Equation in Equation (8). (8). (8). The reward is designed with the concept that its maximization minimizes the closest distance The rewardThe reward is designed is designed with the with concept the concept that its that maximization its maximization minimizes minimizes the closest the closestdistance distance between the missile and the target and aims for saving energy as much as possible. To achieve that, betweenbetween the missile the missile and the and target the andtarget aims and for aims savi forng savienergyng energy as much as asmuch possible. as possible. To achieve To achieve that, that, Appl. Sci. 2020, 10, 6567 7 of 13 the final integrated reward of the episode. We estimated the closest distance between a missile and a target to interpolate the closest approach in a discrete environment as follows:   ZEM , i f Hit  ∗ R =  (12) impact  →R , otherwise  f

→ → where Hit is when Hit condition is true, R f is Rc at the final state, ZEM∗ is defined in Equation (8). The reward is designed with the concept that its maximization minimizes the closest distance between the missile and the target and aims for saving energy as much as possible. To achieve that, two branches of reward term should be considered. One of them is a range reward term. The range reward term should get value from how the missile approaches close to the target. We used negative 2 log-scaled Rimpact (the impact distance error of an episode) as a reward to make the reward overcome the following problem. A simple linear scale reward may cause inconsistency and high variance of 2 reward. While an early training is proceeding in the environment, the expected bounds of Rimpact is from 0 to 16, 5002. This colossal scale disturbs the missile not to approach extremely close to the target since the reward of the terminal stage only seemingly slightly varies the variance of reward even if 2 there are distinguishable good and bad actions. Besides, the log-scaled Rimpact helps the missile get 2 extremely close to the target by amplifying the small value of Rimpact at impact stages. We express the range reward as rR and is as follows:

 2  rR = log R +  (13) − impact

8 where  (tolerance term for log definition) is set to 10− . The second branch is an energy reward term. It represents the energy consumption in an episode. The energy term should be distinguishable from what is better or not regardless of the initial condition. Therefore, it must contain influences from the flight range and the initial look-angle. The energy reward term also has a high variance between the early stage and the terminal stage. For the earlier stage, the missile is prone to terminate its episode earlier because the Out of Range termination condition can be satisfied early when the policy is not that well established. In other words, the magnitude of the energy reward term can be too small. Thus, the log-scaled reward was also applied in the energy reward term as follows:

 R t f 2   am dt  r =  0  E ln  (14) − Vm t √ L + 0.5 · f · | 0| where am is acceleration, t f is the termination time. Therefore, the total reward is a weighted sum of both (13) and (14) as follows ! ! rR µrR rE µrE rtotal = ω1 − + ω2 − (15) σrR σrE where ω1 and ω2 are weights for each reward that satisfy ω1 + ω2 = 1, µ, and σ are the means and standard deviations for normalization, respectively, and subscripts represent the properties of each reward. Table2 shows the termination conditions and corresponding rewards.

Table 2. Definition of conditions and transition rewards under corresponding conditions.

Condition Name Conditions Rewards     rR µr rE µr π − R − E Out of range i f L < 2 rtotal = ω1 σ + ω2 σ | | rR rE Time out i f t tmax r = 1 ≥ total − Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 12 two branches of reward term should be considered. One of them is a range reward term. The range reward term should get value from how the missile approaches close to the target. We used negative log-scaled 𝑅 (the impact distance error of an episode) as a reward to make the reward overcome the following problem. A simple linear scale reward may cause inconsistency and high variance of reward. While an early training is proceeding in the environment, the expected bounds of 𝑅 is from 0 to 16,500 . This colossal scale disturbs the missile not to approach extremely close to the target since the reward of the terminal stage only seemingly slightly varies the variance of reward even if there are distinguishable good and bad actions. Besides, the log-scaled 𝑅 helps the missile get extremely close to the target by amplifying the small value of 𝑅 at impact stages. We express the range reward as 𝑟 and is as follows: 𝑟 =−log𝑅 +𝜖 (13) where 𝜖 (tolerance term for log definition) is set to 10. The second branch is an energy reward term. It represents the energy consumption in an episode. The energy term should be distinguishable from what is better or not regardless of the initial condition. Therefore, it must contain influences from the flight range and the initial look-angle. The energy reward term also has a high variance between the early stage and the terminal stage. For the earlier stage, the missile is prone to terminate its episode earlier because the Out of Range termination condition can be satisfied early when the policy is not that well established. In other words, the magnitude of the energy reward term can be too small. Thus, the log-scaled reward was also applied in the energy reward term as follows:

𝑎 𝑑𝑡 𝑟 =−ln (14) 𝑉 ⋅𝑡 ⋅ |𝐿| +0.5 where 𝑎 is acceleration, 𝑡 is the termination time. Therefore, the total reward is a weighted sum of both (13) and (14) as follows

𝑟 − 𝜇 𝑟 −𝜇 𝑟 =𝜔 + 𝜔 (15) 𝜎 𝜎 where 𝜔 and 𝜔 are weights for each reward that satisfy 𝜔 +𝜔 =1, 𝜇, and 𝜎 are the means and standard deviations for normalization, respectively, and subscripts represent the properties of each reward. Table 2 shows the termination conditions and corresponding rewards.

Table 2. Definition of conditions and transition rewards under corresponding conditions.

Condition Name Conditions Rewards

𝜋 𝑟 − 𝜇 𝑟 −𝜇 Out of range 𝑖𝑓 |𝐿| < 𝑟 =𝜔 + 𝜔 2 𝜎 𝜎 Appl. Sci. 2020, 10, 6567Time out 𝑖𝑓 𝑡 ≥𝑡 𝑟 =−1 8 of 13

An aspectAn aspect of the of the case case that that satisfies satisfies the the Time Time OutOut conditioncondition while while Out Out of Rangeof Range condition condition is not is not satisfiedsatisfied is the is the maneuver maneuver in in which the the missile missile is approachingis approaching infinitely infinitely in spiral in trajectory, spiral trajectory, which made which madeus us aware aware that thethat magnitude the magnitude of the cost of wasthe becomingcost was too becoming big. Thus, too we consideredbig. Thus, the we maneuvers considered as the maneuversoutliers andas outliers gave them and the gave lowest them score the and lowest stored themscore to and the replaystored buthemffer. Theto the overall replay structure buffer. of The overallthe structure RL missile of simulation the RL missile is shown simulation in Figure8 is. shown in Figure 8.

Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 12 FigureFigure 8. Missile 8. Missile simulation simulation schematic schematic diagram withwith reinforcement reinforcemen learningt learning (RL). (RL).

Our Ourrewarding rewarding method method makes makes the the agent agent trained fasterfaster than than the the general general DDPG DDPG algorithm algorithm and and makesmakes the training the training stable. stable. Clearly, Clearly, our our rewarding rewarding methodmethod works works better better for for the the sparse sparse reward reward and and vast vast environmentenvironment of our of our missile, missile, and and the the learning learning curves forfor them them are are shown shown in Figurein Figure9. 9.

FigureFigure 9. 9.LearningLearning curves curves comparison comparison forfor rewarding rewarding methods. methods.

The Thefull fulllines lines represent represent the the reward reward trend trend and and colo color-filledr-filled areas areas represent represent local local mins mins and and maxes. maxes. Additionally, all data was plotted by applying a low pass filter for the intuitive plot. Additionally, all data was plotted by applying a low pass filter for the intuitive plot. After the training of 20,000 episodes, we implemented the utility evaluation of each trained weight. After the training of 20,000 episodes, we implemented the utility evaluation of each trained Specifically, the training started to converge at about the 70th episode via the training structure we weight.designed. Specifically, Yet, for the finding training the most started optimized to conv weight,erge weat ranabout the the test algorithm70th episode with via each the trained training structureweight we by designed. feeding 100 Yet, of the for uniformly finding randomthe most predefined optimized initial weight, conditions we ran dataset. the test The algorithm dataset was with eachset trained to range weight from by 30,000 feeding m to 50,000100 of mthe as uniformly depicted in random Figure 10 predefined. It is because initial we wanted conditions a weight dataset. The thatdataset covers was universal set to range conditions. from The30,000 ZEM* m to was 50,000 not considered m as depicted to decide in Figure the utility 10. as It longis because as the we wantedZEM* a weight was less that than covers 5 for alluniversal predefined conditions. initial conditions. The ZEM* In was addition, not considered the weight to having decide the the best utility R t f 2 as longutility as the was ZEM* decided was by less selecting than a5 weightfor all thatpredefined has a minimum initial conditions. average 0 a Inm dtaddition,for all predefined the weight initial having conditions. In Figure 10, the vectors and green dots represent the initial headings and the launching the best utility was decided by selecting a weight that has a minimum average 𝑎 𝑑𝑡 for all predefinedposition, initial respectively. conditions. In Figure 10, the vectors and green dots represent the initial headings and the launching position, respectively.

(a) (b) (c) Figure 10. (a) Simulation dataset; (b) Legends; (c) utility evaluation dataset The flow chart for overview of this study is shown in Figure 11.

Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 12 Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 12 Our rewarding method makes the agent trained faster than the general DDPG algorithm and Our rewarding method makes the agent trained faster than the general DDPG algorithm and makes the training stable. Clearly, our rewarding method works better for the sparse reward and vast makes the training stable. Clearly, our rewarding method works better for the sparse reward and vast environment of our missile, and the learning curves for them are shown in Figure 9. environment of our missile, and the learning curves for them are shown in Figure 9.

Figure 9. Learning curves comparison for rewarding methods. Figure 9. Learning curves comparison for rewarding methods. The full lines represent the reward trend and color-filled areas represent local mins and maxes. The full lines represent the reward trend and color-filled areas represent local mins and maxes. Additionally, all data was plotted by applying a low pass filter for the intuitive plot. Additionally, all data was plotted by applying a low pass filter for the intuitive plot. After the training of 20,000 episodes, we implemented the utility evaluation of each trained After the training of 20,000 episodes, we implemented the utility evaluation of each trained weight. Specifically, the training started to converge at about the 70th episode via the training weight. Specifically, the training started to converge at about the 70th episode via the training structure we designed. Yet, for finding the most optimized weight, we ran the test algorithm with structure we designed. Yet, for finding the most optimized weight, we ran the test algorithm with each trained weight by feeding 100 of the uniformly random predefined initial conditions dataset. each trained weight by feeding 100 of the uniformly random predefined initial conditions dataset. The dataset was set to range from 30,000 m to 50,000 m as depicted in Figure 10. It is because we The dataset was set to range from 30,000 m to 50,000 m as depicted in Figure 10. It is because we wanted a weight that covers universal conditions. The ZEM* was not considered to decide the utility wanted a weight that covers universal conditions. The ZEM* was not considered to decide the utility as long as the ZEM* was less than 5 for all predefined initial conditions. In addition, the weight having as long as the ZEM* was less than 5 for all predefined initial conditions. In addition, the weight having the best utility was decided by selecting a weight that has a minimum average 𝑎 𝑑𝑡 for all the best utility was decided by selecting a weight that has a minimum average 𝑎 𝑑𝑡 for all predefined initial conditions. In Figure 10, the vectors and green dots represent the initial headings predefinedAppl. Sci. 2020 ,initial10, 6567 conditions. In Figure 10, the vectors and green dots represent the initial headings9 of 13 and the launching position, respectively. and the launching position, respectively.

(a) (b) (c) (a) (b) (c) Figure 10. (a) Simulation dataset; (b) Legends; (c) utility evaluation dataset FigureFigure 10. 10. ((aa)) Simulation Simulation dataset; dataset; (b ()b Legends;) Legends; (c) utility(c) utility evaluation evaluation dataset. dataset The flow chart for overview of this study is shown in Figure 11. The flowThe flowchart chartfor overview for overview of this of study this study is shown is shown in Figure in Figure 11. 11.

Figure 11. Flow chart of this study.

5. Simulation The simulation was implemented under conditions that acceleration command bypasses or passes through the 1st order controller model. The test dataset was distinguished from the utility evaluation dataset and contains 100 of uniformly random initial conditions that satisfy the initial condition of the engagement scenario which is described in Table2.

5.1. Bypassing Controller Model We used the weight with the highest utility. This was obtained through the simulation with predefined initial conditions. Comparison of PNG with various navigation coefficients and the RL-based guidance is as follows. Under the ideal condition, RL-based guidance law in Table3 shows it consumes similar energy to PNG that has navigation constant 3–4 under ideal condition. Even though the training was implemented on a single range parameter of 16,500 m, the RL-based algorithm shows it works well even if the missile launches far from the trained initial conditions

Table 3. Energy consumption comparison under ideal state.

PNG RL-Based N = 2 N = 3 N = 4 N = 5 3901.99 3142.81 3424.72 3881.95 3206.764

As shown in Table3, RL-based missile guidance law has a reasonable and appropriately optimized performance. It is intriguing because the training optimization resembles the algebraic quasi-optimal, Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 12

Figure 11. Flow chart of this study.

5. Simulation The simulation was implemented under conditions that acceleration command bypasses or passes through the 1st order controller model. The test dataset was distinguished from the utility evaluation dataset and contains 100 of uniformly random initial conditions that satisfy the initial condition of the engagement scenario which is described in Table 2.

5.1. Bypassing Controller Model We used the weight with the highest utility. This was obtained through the simulation with predefined initial conditions. Comparison of PNG with various navigation coefficients and the RL- based guidance is as follows. Under the ideal condition, RL-based guidance law in Table 3 shows it consumes similar energy to PNG that has navigation constant 3–4 under ideal condition. Even though the training was implemented on a single range parameter of 16,500 m, the RL-based algorithm shows it works well even if the missile launches far from the trained initial conditions

Table 3. Energy consumption comparison under ideal state.

PNG RL-Based N = 2 N = 3 N = 4 N = 5 3901.99 3142.81 3424.72 3881.95 3206.764

As shown in Table 3, RL-based missile guidance law has a reasonable and appropriately Appl. Sci. 2020, 10, 6567 10 of 13 optimized performance. It is intriguing because the training optimization resembles the algebraic quasi-optimal, just like the predatory flies and peregrine falcons of [1,2]. Additionally, the trajectories ofjust both like have the predatorysimilar aspects flies andas follows. peregrine falcons of [1,2]. Additionally, the trajectories of both have similarFigure aspects 12 shows as follows. trajectories for all simulation datasets of RL and PNG. The target is at the coordinateFigure of 12 (0, shows 0) and trajectoriesthe starting forpoint all is simulation the other end datasets of the of trajectory RL and PNG.spline. TheThe targetgraph isplotted at the oncoordinate the right ofshows (0, 0) trajectories and the starting of a magnified point is the scale. other As end it ofis theshown trajectory in the spline.trajectories The graphin Figure plotted 12, everyon the trajectory right shows pierces trajectories the central of area a magnified of the ta scale.rget and As unquestionably it is shown in thesatisfies trajectories 5 m accuracy. in Figure The 12 , simulationevery trajectory result here pierces implies the centralthat the area RL-based of the missile target guidance and unquestionably law is able to satisfies replace 5the m PNG. accuracy. The simulation result here implies that the RL-based missile guidance law is able to replace the PNG.

(a) (b)

FigureFigure 12. 12. (a()a )Simulation Simulation trajectories; trajectories; (b (b) )Trajectories Trajectories near near target. target.

5.2.5.2. Passing Passing Controller Controller Model Model TheThe simulation simulation of of this this part part is is implemented implemented vi viaa the the acceleration acceleration comm commandand passing passing through through the the controllercontroller model. model. As As it it is is mentioned mentioned in in Section Section 2.3 2.3 the the controller controller is is simply simply modeled modeled with with 1st 1st order order differential equation i.e., (9). For the 1st order, controller model is formulated with condition of τ {0, differential equation i.e., (9). For the 1st order, controller model is formulated with condition of ∈𝜏 ∈ 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.75}. We carried out a simulation with those predefined parameters and graphed its results including the trend lines. The miss cost is the average ZEM∗ for the 100 initial conditions, and the energy cost is the average of integral values for the square of derived acceleration with each initial condition. The simulation results are as follows Note that the RL-based is trained without any controller model and has an equivalent energy cost of PNG that has navigation constant 3–4. In terms of the miss cost, RL-based seems less robust to the system. Yet, the time constant of 0.75 is quite a harsh condition. Nevertheless, they have a hit rate of 100% for all initial conditions. RL-based and PNG both far exceed the requirement even if they have been engaged with a system that we had intended to push them to the limit. Therefore, the robustness on ZEM* does not matter. On the other hand, the energy consumption of RL-based seems more robust to the system than the PNG is. As shown in Figures 13–15, the energy cost trend line of PNG with respect to time constants shows a larger gradient than the energy cost trend line of RL-based. We consider a robustness criterion as follows:

fPNG fRL η[%] = ∇ − ∇ 100 (16) f + fRL × ∇ PNG ∇ where f is the linear trend line of each cost for each guidance law that their subscript denotes, and ∇ indicates the gradient of each trend line. With the criterion of the robustness, RL-based shows 64%, 12%, and 12% better robustness than PNG with N = 2, N = 3, and N = 4 respectively. We suspect that it is because the RL is trained with uncertainties and may bring policies that are robust to the system. Contrary to ZEM* robustness, energy robustness matters in the engagement scenario, because less energy consumption suggests farther radius of action of a missile. Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 12 Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 12

{0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.75}. We carried out a simulation with those predefined parameters and graphed its results including the trend lines. The miss cost is the average 𝑍𝐸𝑀∗ 𝑍𝐸𝑀∗ for the 100 initial conditions, and the energy cost is the average of integral values for the square of derived acceleration with each initial condition. The simulation results are as follows Note that the RL-based is trained without any controller model and has an equivalent energy cost of PNG that has navigation constant 3–4. In terms of the miss cost, RL-based seems less robust to the system. Yet, the time constant of 0.75 is quite a harsh condition. Nevertheless, they have a hit rate of 100% for all initial conditions. RL-based and PNG both far exceed the requirement even if they have been engaged with a system that we had intended to push them to the limit. Therefore, the robustness on ZEM* does not matter. On the other hand, the energy consumption of RL-based seems more robust to the system than the PNG is. As shown in Figures 13, 14 and 15, the energy cost trend line of PNG with respect to time constants shows a larger gradient than the energy cost trend line of Appl.line of Sci. PNG2020, 10with, 6567 respect to time constants shows a larger gradient than the energy cost trend 11line of 13of RL-based. We consider a robustness criterion as follows:

Figure 13. Cost trend for various 𝜏 (N = 2) FigureFigure 13.13. Cost trend forfor variousvarious τ𝜏(N (N= =2). 2)

Figure 14. Cost trend for various 𝜏 (N = 3) FigureFigure 14.14. Cost trend forfor variousvarious τ𝜏(N (N= =3). 3)

Figure 15. Cost trend for various 𝜏 (N = 4) FigureFigure 15.15. CostCost trendtrend forfor variousvariousτ 𝜏. (N (N= = 4).4) 5.3. Additional Experiment 𝛻𝑓 −𝛻𝑓 𝛻𝑓 −𝛻𝑓 𝜂[%] = × 100 (16) We found that it is possible to𝜂[%] make a = 𝛻 LOS-rate-only-input𝑓 +𝛻𝑓 × 100 missile guidance law via RL. We trained(16) 𝛻𝑓 +𝛻𝑓 it with the same method and network, but without the V input. It is useful under circumstances that V where 𝑓 is the linear trend line of each cost for each guidance law that their subscript denotes, and cannotwhere be𝑓 fed is the anymore linear trend due to line the of sensor each faultcost for and each unknown guidance hindrances. law that To their make subscript a comparison denotes, target, and 𝛻 indicates the gradient of each trend line. With the criterion of the robustness, RL-based shows 64%, we𝛻 indicates formulated the PNG gradient with of the each input trend ofthe line. LOS With rate the only criterion by merging of the robustness, previous navigation RL-based constant shows 64%, and 12%, and 12% better robustness than PNG with N = 2, N = 3, and N = 4 respectively. We suspect that expectation12%, and 12% of velocitybetter robustness into a single than navigation PNG with constant, N = 2, N which= 3, and is asN = follows. 4 respectively. We suspect that it is because the RL is trained with uncertainties and may bring policies that are robust to the system.

eac = NeΩ (17) where Ne is the navigation constant for LOS-rate-only guidance. The performance comparison run with the simulation dataset of Figure 10a is summarized in Table4 and Figure 16.

Table 4. Energy consumption for each guidance law.

PNG N=3 PNG Ne=1200 RL-Based PNG N = 4 PNG Ne=1600

Ω, Vm Ω only Ω,Vm Ω only Ω,Vm Ω only 3142.81 3240.3 3213.42 3280.5 3424.72 3429.86 Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 12

Contrary to ZEM* robustness, energy robustness matters in the engagement scenario, because less energy consumption suggests farther radius of action of a missile.

5.3. Additional Experiment We found that it is possible to make a LOS-rate-only-input missile guidance law via RL. We trained it with the same method and network, but without the V input. It is useful under circumstances that V cannot be fed anymore due to the sensor fault and unknown hindrances. To make a comparison target, we formulated PNG with the input of the LOS rate only by merging previous navigation constant and expectation of velocity into a single navigation constant, which is as follows.

𝑎 = 𝑁𝛺 (17) where 𝑁 is the navigation constant for LOS-rate-only guidance. The performance comparison run with the simulation dataset of Figure 10a is summarized in Table 4 and Figure 16.

Table 4. Energy consumption for each guidance law.

PNG 𝑵= 3 PNG 𝑵= 1200 RL-Based PNG 𝑵= 4 PNG 𝑵= 1600 𝛀, 𝐕 𝛀 only 𝛀, 𝐕 𝛀 only 𝛀, 𝐕 𝛀 only Appl. Sci. 2020, 10, 6567𝐦 𝐦 𝐦 12 of 13 3142.81 3240.3 3213.42 3280.5 3424.72 3429.86

Figure 16. Energy cost trend for line-of-sight (LOS)-rate-only(LOS)-rate-only guidance for various τ𝜏..

EveryEvery singlesingle simulationsimulation easilyeasily satisfiessatisfies thethe hithit condition.condition. Additionally, understandably, theythey showshow poorerpoorer energyenergy performanceperformance thanthan intactintact guidanceguidance lawslaws do.do. Meanwhile, even with this scenario, thethe RL-basedRL-based algorithmalgorithm isis moremore robustrobust toto thethe controllercontroller model.model. InIn thethe samesame mannermanner asas thethe criterioncriterion ofof robustnessrobustness describeddescribed in Equation (16), LOS-rate-onlyLOS-rate-only RL-based shows 44% and 10%, respectively, betterbetter than LOS-rate-only PNG PNG with with N𝑁e = 1200 andandN e𝑁==1600 1600. The. The results results of of the the simulation simulation suggestsuggest thatthat thethe RL-basedRL-based missile missile guidance guidance law law works works better better than than the the PNG PNG when when the actuationthe actuation delay delay goes goes past apast certain a certain threshold threshold point. point.

6.6. Conclusions TheThe novelnovel rewardingrewarding method,method, performance,performance, andand specialspecial featuresfeatures ofof RL-basedRL-based missilemissile guidanceguidance werewere presented.presented. We showedshowed thatthat aa widewide environmentenvironment ofof missilemissile trainingtraining cancan bebe trainedtrained fasterfaster withwith ourour rewardingrewarding methodmethod ofof Monte-Carlo basedbased shaped rewardreward rather than temporal didifferencefference training methodology.methodology. Additional Additionally,ly, the performance comparisoncomparison waswas presented.presented. This showed that thethe energy-wiseenergy-wise performanceperformance ofof ourour RL-basedRL-based guidanceguidance lawlaw andand PNGPNG withwith navigationnavigation constantconstant ofof 33 toto 44 werewere similar,similar, whenwhen usingusing aa systemsystem thatthat immediatelyimmediately acts.acts. Yet, whenwhen itit comescomes toto thethe delayeddelayed 1st1st order system,system, RL-basedRL-based showsshows betterbetter energy-wise systemsystem robustnessrobustness thanthan PNG.PNG. Moreover, wewe showedshowed thatthat thethe RL-basedRL-based missile missile guidance guidance algorithm algorithm could could be formulated be formulated without thewithout input statethe input of V. Additionally,state of V. thisAdditionally, algorithm this has algorithm better robustness has better to robustness the controller to the models controller than models PNG without than PNG V. Itswithout significant V. Its expandabilitysignificant expandability is that it can is that replace it can PNG replace and hasPNG good and robustnesshas good robustness to the controller to the controller model without model designingwithout designing a specific a guidance specific guidance law for the law controller for the controller model of model a missile of a itself. missile itself. This paper showed that RL-based not only can easily replace the existing missile guidance law but is also able to work more efficiently in various circumstances. These special features give insights into the RL based guidance algorithm which can be expanded for future work.

Author Contributions: Methodology, D.H., S.P.; software and simulation, D.H., M.K. and S.P.; writing—original draft preparation, D.H. and M.K.; writing—review and editing, D.H., M.K. and S.P.; supervision, S.P.; funding acquisition, S.P. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2019R1F1A1060749). Conflicts of Interest: The authors declare no conflict of interest.

References

1. Fabian, S.T.; Sumner, M.E.; Wardill, T.J.; Rossoni, S.; Gonzalez-Bellido, P.T. Interception by two predatory fly species is explained by a proportional navigation feedback controller. J. R. Soc. Interface 2018, 15.[CrossRef] [PubMed] 2. Brighton, C.H.; Thomas, A.L.R.; Taylor, G.K. Terminal attack trajectories of peregrine falcons are described by the proportional navigation guidance law of missiles. Proc. Natl. Acad. Sci. USA 2017, 114, 201714532. [CrossRef][PubMed] 3. Gaudet, B.; Furfaro, R. Missile Homing-Phase Guidance Law Design Using Reinforcement Learning. In Proceedings of the AIAA Guidance, Navigation, and Control Conference 2012, Minneapolis, MN, USA, 13–16 August 2012. [CrossRef] Appl. Sci. 2020, 10, 6567 13 of 13

4. Gaudet, B.; Furfaro, R.; Linares, R. Reinforcement Learning for Angle-Only Intercept Guidance of Maneuvering Targets. Aerosp. Sci. Technol. 2020, 99, 105746. [CrossRef] 5. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347v2. 6. Zarchan, P. Tactical and Strategic Missile Guidance, 6th ed.; American Institute of Aeronautics and Astronautics, Inc.: Reston, VA, USA, 2013. [CrossRef] 7. Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; MacGrew, B.; Tobin, J.; Abbeel, P.; Zaremba, W. Hindsight Experience Replay. arXiv 2018, arXiv:1707.01495v3. 8. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.P.; Heess, N.H.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2019, arXiv:1509.02971v6. 9. Gurfil, P. Robust Zero Miss Distance Guidance for Missiles with Parametric Uncertainties. In Proceedings of the 2001 American Control Conference, Arlington, VA, USA, 25–27 June 2001. [CrossRef] 10. Li, H.; Li, H.; Cai, Y. Efficient and accurate online estimation algorithm for zero-effort-miss and time-to-go based on data driven method. Chin. J. Aeronaut. 2019, 32, 2311–2323. [CrossRef] 11. Madany, Y.M.; El-Badawy, E.A.; Soliman, A.M. Optimal Proportional Navigation Guidance Using Pseudo Sensor Enhancement Method (PSEM) for Flexible Interceptor Applications. In Proceedings of the 2016 UKSim-AMSS 18th International Conference on Computer Modelling and Simulation (UKSim), Cambridge, UK, 6–8 April 2016. [CrossRef] 12. Raj, K.D.S. Performance Evaluation of Proportional Navigation Guidance for Low-Maneuvering Targets. Int. J. Sci. Eng. Res. 2014, 5, 93–99. 13. Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).