fZein,Isiueo nomto n oto,Hangzho [email protected]). Control, (e-mail: and China Information 310018, of Hangzhou Institute Zhejiang, of snhaUiest,Biig108,Cia(-al shiji [email protected]). (e-mail: [email protected]; China 100084, University, Tsinghua oeta 5 iiswr otn prxmtl 6 subway 160 approximately has hosting world. were subway the around cities first systems until 150 the and than Farringdon, 1863, to more Paddington 9th, more from January operation and and started Since punctuality, more safety, [1]. on getting advantages efficiency is its subway to due urban attention the problems, energy enocmn learning. reinforcement X J,U emi:[email protected]). (e-mail: UK 2JD, OX1 hswr a oewe h rtato a ihDprmn of Department with S was Prof.Shiji 2018 author of Grant supervision first the under the under Program when University, Tsinghua done Research th was Major by work funded National This is the and 61936009, of Grant plan under China of Foundation pe uv 2.Tu,teAOsse a ietinfluence direct a has system speed ATO a the t generated sends Thus, the [2]. then track to curve and train speed the tra condition, low- control the to railway its command both control the under for tar requirements and the driving various generates condition on manual first based been system replace curve ATO has speed The to system intelligence. places operation and train cost many automatic in the used subway, the of arXiv:2003.03327v3 condition adapt resistance can different STON and and and times Subway STOD algorith trip ATO Moreover, different Beijing efficiency. algori existing to the energy and operation driving of train of manual terms smart expert Line developed than Yizhuang better the th the that with [cs.CE] from illustrate simulations performa data normal numerical the some on field verify via based we algorithms algorithm Finally, proposed operation (STON). train function an algo smart (STOD) advantage (STO) the named operation gradient is policy train efficien other deterministic smart energy deep the the on is based optimizing One for syste operation. subway algorithms train the two of develop safety lear the 1we th and through guarantee punctuality, to t the object Firstly, methods extract comfort, inference we build critical profile. Jan and drivers, rules multiple subway speed knowledge experienced optimize offline of data and an historical system using continuo subway without of control the the 2021 s realize for can two algorithms previou proposes proposed with the Compared paper knowl expert algorithms. This the learning integrating reinforcement intelligence. by syste algorithms and subway operation many train low-cost in its adopted gradually for been has system (ATO) .Xei ihteKyLbrtr o O n nomto Fusi Information and IOT for Laboratory Key the with is Xue A. .Sn,K o n .W r ihDprmn fAtmto and Automation of Department with are Wu H. and You K. Song, S. .Zo swt eateto optrSine nvriyo University Science, Computer of Department with Natural is Zhou National K. the of project key the by funded is work This RPITVRIN ULSE NIE RNATOSO SYSTEM ON TRANSACTIONS IEEE IN PUBLISHED VERSION. PREPRINT enhl,wt h ceeaino h oenzto proc modernization the of acceleration the with Meanwhile, and problems traffic urban modern of deterioration the With ne Terms Index Abstract mr ri prto loihsbsdo Expert on based Algorithms Operation Train Smart Drn eetdcds h uoai ri operation train automatic the decades, recent —During Sattanoeain uwy xetknowledge, expert subway, operation, train —Smart ace Zhou, Kaichen nweg n enocmn Learning Reinforcement and Knowledge .I I. NTRODUCTION eo You, Keyou tdn ebr IEEE, Member, Student ebr IEEE Member, [email protected]; n n rKyuYou. Dr.Keyou and ong iniUniversity, Dianzi u xod Oxford Oxford, f ao research major e nTechnology on AAA0101604. s. Automation, dewith edge eexpert he saction us .Then m. h are thm works, s riding e energy hj Song, Shiji BNRist, Science c of nce 2015, real e ,MN N YENTC:SSES-https://ieeexplore.iee - SYSTEMS CYBERNETICS: AND MAN, S, sin ms rithm the d yof cy arget mart ning ized ives n u Wu, Hui and get ess ms in 1 re omnmz h oa rvltm fpsegr n the and passengers of Wang time operation, train travel punctuality. the total of and consumption the invo comfort energy minimize riding also as to problem such order operation aspects, st train each other the tha for many proved Besides, defined thus uniquely functi [3]. are and energy section points minimum, local switching unique key optimal a the a with convex that strictly show is to analysis perturbation nrycnupinadti iea h anojcie of objectives main the as t designing time Considering by trip [5]. Wei solution optimization, encoding and optimal binary consumption the with energy and find algorithm time genetic to headway a control with time model dwell programming integer objective nofln piie ri rjcoyt mrv h energ the improve to Albrecht trajectory example, train For optimized efficiency. off-line t an real-time the track to method profile. tracking speed-distance system train traject subway train automatic ope optimized urban train the off-line energy-efficient an the the designing parts: to on two committed into research transportatio divided the cases, be of can field most the of In in performance focus system. a the become improving has And system ATO trajectory. train’s the on aigadtesrieqaiy Yang ener quality, the service on times, Focusing the [4]. dwell and problem and obtain saving scheduling times to train running the approach times, solve (ICP) departure programming optimal convex the iterative new a ihtecnieaino aibegainsadarbitrary constan and always gradients Khmelnitsky not variable limits, of are trajec consideration railways speed the of With optimal character parameters using track the different by Moreover, under trajectory strategies searching speed tory the for model tutr ehiu n iedlydcmestr odes to compensator, time-delayed a and technique structure enue ntetanoeain Açıkba¸s operation. train the in have algorithms used developmen the intelligent been With many [7]. intelligence, profile artificial velocity optimal the obtain h xeteprec n nweg 1] Wu [11]. accordi knowledge software and meth Matlab experience control the expert in the fuzzy system the control a on designed based and system control railway speed ocntuta nelgn ri prto ytmta can that system operation [10]. train objectives intelligent multiple an network construct railway approxim to the the on seek strategies Yin and control difficulties coasting calculation optimal the reduce al Yang genetic to [8]. and time methodologies travel simulation-based given optim integrated a for to ob consumption to algorithm energy order imum in genetic trajectory speed-distance a of points with coasting networks neural artificial rtclrsac oi.Freape Liu example, For topic. research critical uigrcn er,lto tde r eoe odesigning to devoted are studies of lot years, recent During oevr rcigtera-ietansedpol salso is profile speed train real-time the tracking Moreover, ebr IEEE, Member, tal. et tdn ebr IEEE Member, Student sdterifreetlann- erigmethod, learning learning-Q reinforcement the used tal. et tal. et neXue, Anke eeoe ut-betv optimization multi-objective a developed osrce ueia loih to algorithm numerical a constructed ebr IEEE Member, tal. et tal. et e.org/abstract/document/9144488. tal. et sdacomprehensive a used tal. et omltdatwo- a formulated tal. et rpsdahigh- a proposed tal. et sdvariable used implemented sis[6]. istics proposed anmin- tain r,and ory, recently gorithm ration speed tal. et gto ng meet lves rain of t [9]. eep gy- ign the ize ate on od he In to t. n y a - t PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488. a state observer-based adaptive fuzzy controller to approximate 1 the unknown system parameters, and thus trajectory tracking problem of a series of two-wheeled self-balancing vehicles can ) be addressed [12]. Gu et al. proposed a new energy-efficient train )F )G operation model based on real-time traffic information from the )U geometric and topographic points of view through a nonlinear programming method [13]. Recently, Li et al. designed a robust * sampled-data cruise control scheduling with the form of linear matrix inequality(LMI) and proposed numerical examples that )J verified the effectiveness of the proposed algorithms to track the SD trajectory precisely [14]. Despite great achievements in previous studies, there are still some essential problems unsolved, which blocks the development Fig. 1: The force diagram of the train. of the ATO system. Firstly, for multiple objectives of train operation, most researchers mentioned above had just taken one or two objectives into account, and there is no comprehensive problems and formulate multiple objectives of train operation analysis about designing optimal train operation to meet multiple into numerical evaluation indices to systematically evaluate the objectives. Secondly, the modern subway is capable of outputting train operation problem. Plus, we state the objectives of this continuous traction and braking force [15], however, there are rare paper. In Section III, the structure of STO is presented. Then researches devoted to design the control model for continuous we put forward expert knowledge rules and summarize inference action while considering complicated train operation conditions, methods. Besides, the principles and the algorithms of STOD such as variables speed limits. Thirdly, in the ATO system, the and STON are explained. In Section IV, the simulation platform optimized speed profile was designed before the operation of the is built. Three numerical simulations are made based on the real train, and the train is controlled to track the designed optimized field data of YLBS. Conclusions are summarized in Section V. speed profile during the trip time which largely decreases the flexibility and the robustness of the ATO system. Plus, it is hard II. PROBLEM FORMULATION to implement complicated mathematic optimal methods to treat In general, the problem of train control can be formulated as an the nonlinear train operation problem, thus it is necessary to optimal control problem with focus on finding an optimal control design a model that can realize train control without considering strategy for traction and braking force during the trip time. Firstly, offline optimized speed profile. Finally, the real subway operation we define ∆t as the minimum time interval, and the trip time of is faced with many unexpected situations, such as, the changed train can be described as follow: trip time of one subway which influences the timetable of the t = t + ∆t, (1) whole line, and the railway aging which changes the railway i+1 i resistance condition. Being faced with these problems, modern for 0 ≤ i ≤ n − 1. The total trip time T is defined as subway always transfers from the ATO system to manual driving T = t − t , (2) which largely decreases the intelligence and efficiency of train n 0 operation. From the analysis above, the contribution of the paper where the initial running time t0 = 0(s). can be listed as follow: • Multiple objectives of train operation are summarized and A. Control model of train relevant evaluation indices are formulated. Through analyz- The motion of the train is determined by the output force, ing references, we summarize the expert knowledge rules the resistance caused by the gradient of railway, the resistance and build the inference methods. They are systematically to motion, the curve resistance and the resistance caused by combined with reinforcement learning algorithms to help the interactive impacts among the vehicles. According to Newton’s algorithm have better performance. second law, its movement equation is defined as • We establish STOD and STON based on deep deterministic policy gradient (DDPG) and normalized advantage function M(1 + η)u = F − Fg − Fr − Fc − Fd, (3) (NAF). On the one hand, reinforcement learning can realize where M represents the static mass of the train; η is the rotating model-free control. On the other hand, both DDPG and NAF factor of the train which is defined as η = Mη/M, and Mη is are able to deal with control tasks of continuous action. the reduced mass of trains rotator; u is the acceleration or the • The effectiveness of STOD and STON is verified by using deceleration; F is the outputted traction force or braking force the field data of the of the of subway; Fg is the resistance caused by the gradient; Fr is the (YLBS). The performance of the proposed STOD and STON resistance to motion given by David Equation; Fc is the curve is compared with the existing intelligent train operation in resistance and Fd is the interactive impacts among the vehicles. [10] and the manual driving, which illustrates that both The force diagram of the train is shown in Fig. 1. Their definition STOD and STON have better performance than that of can be described as follow: ITOR and manual driving. And through conducting different numerical simulations, the flexibility and the robustness of Fg = Mgsin(α(s)), (4) STOD and STON are proved. where α(s) represents the slope angle of the railway at position The rest of this paper is organized as follows. In Section s. 2 II, we define necessary mathematic indices of train movement Fr = d1 + d2v + d3v , (5)

2 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

is very short, the accidental delay problem may influence the timetable of the whole line. We first define the running time 6SHHG PV $FFHOHUDWLQJSKUDVH &RDVWLQJSKUDVH error et as %UDNLQJSKUDVH et = |Tactual − Tplanning |, (10)

where Tactual is the actual running time of the train, and Tplanning is the planning trip time of the train. In this paper, if the running time error et is superior to 3s, the subway is not punctual. Thus the definition of the punctuality evaluation index It is given as

1 et ≤ 3 It = . (11) 'LVWDQFH P  0 et > 3

• Energy efficiency. The energy efficiency is one of the focuses of modern society and the energy consumption makes up Fig. 2: Speed limits. a large part of the cost of train operation. These concerns make energy efficiency play a core actor in our control model designing. According to [24], the equation to calculate the where v is the velocity; d1, d2 and d3 are vehicle specific consumed energy E is described as coefficients which are measured by the run-down experiments n [16]. E = (M|ui|v(ti)∆t). (12) F =6.3M/[r(s) − 55], (6) c Xi=1 where r(s) is the radius of the curve at the position s [17]. Based on the equation of consumed energy, we define the

k−1 k energy efficiency evaluation index Ie as Fd = (∆l¨i mj ), (7) E Ie = . (13) Xi=1 j=Xi+1 M • Riding comfort. The riding comfort is a direct evaluation where k is the number of vehicles; ∆l¨i denotes the second derivative for the distance between the center of the ith vehicle criterion for train service quality [25], and it guarantees that the instantaneous change of acceleration or deceleration and the reference point [18]; mj is the static mass of the jth vehicle and the static mass M of the whole train can be described should below a certain threshold. We define the jerk or the k rate of change of acceleration ∆u as: as M = j=1 mj . Because there exist nonlinearity and time delay in a train con- ui − ui−1 P ∆u = | |. (14) trol model, Eq.(8) gives the transfer function of the accelerating i ∆t and decelerating process [19]. Thus the riding comfort evaluation index Ic can be defined

u0 −Tcs as: u = e , (8) n ′ 1+ Tds 0 ∆ui ≤ ∆U Ic = ′ , (15)  ∆ui ∆ui > ∆U where u represents the train actual acceleration or deceleration; Xi=1 u0 is the accelerating or decelerating performance gain; Td and Tc where ∆U ′ is the threshold for change of acceleration, in represent the time delay and the time constant of the accelerating this case, ∆U ′ =0.30m/s3 as proposed in [26]. or the decelerating process. C. Problem statement B. The indices of model evaluation Two designed STO algorithms, including STOD and STON, are In general, the subway control model is generally evaluated supposed to achieve four purposes. Firstly, STO algorithms can from four aspects, i.e., the safety [20], the punctuality [21], the provide the control strategy for the traction force and the braking energy consumption [22] and the passenger comfort [23], which force which can meet the basic requirements, including the safety are defined as follow: and the punctuality for train operation. Secondly, STO algorithms • Safety. There may exist several speed limits between two can perform properly without considering offline designed speed successive subway stations and a general case is presented profile and realize the control for the continuous force. Thirdly, limit limit limit limit as in Fig. 2 where v1 , v2 , v3 and v4 are the control strategy outputted by STO algorithms can outperform four-speed limits of different sections between two stations. experienced subway drivers in the aspect of energy efficiency During the trip time, the velocity of the train must be inferior while ensuring good riding comfort. Last, STO algorithms can to the speed limit of the current section to guarantee safety. adapt to different situations including different trip times and The safety evaluation index Is is defined as different resistance conditions. The fact that the existed ATO system has to track the designed 1 v ≤ vlimit(∀i) I = i i . (9) offline speed profile and modern subway can output continuous s 0 v > vlimit(∃i)  i i traction and braking force, have motivated our study. Moreover, • Punctuality. Punctuality is a very important factor in train op- reinforcement learning has been applied to many fields to deal eration. As the time interval between two adjacent subways with the model-free problem [27] and expert knowledge has also

3 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488. been largely used to improve control strategy [24]. Hence, we where si is the current position ot the train. slimit(j+1) is the have put forward two STO algorithms based on the fusion of starting position of the next section. β is a speed proportional expert knowledge and reinforcement learning. coefficient caused by the time delay and the friction of the railway [1]. umin is the minimum deceleration. In this case, 2. If the current velocity is superior or III. DESIGNOFINTELLIGENTTRAINCONTROLMODEL umin = −1m/s vi equal to vs(i), the train should adapt minimum deceleration In this section, we will give a detailed explanation for STO immediately. algorithms, including the structure of STO, the expert knowledge rules, the inference methods, and the principles for STOD and C. Continuous action control methods based on reinforcement STON. learning As the expert knowledge cannot allow the agent to perform A. The structure of intelligent control model better than experienced drivers and it does not have the learning The structure of STO is shown in Fig. 3. We can learn that process, we combine the expert knowledge with the reinforcement the STO model contains three phases. The first phase is to obtain learning methods. In this way, it can ensure that outputted expert knowledge and to develop inference methods which are control strategy meets basic requirements and it can also have the essential for building a stable model. The second phase is to possibility to find an optimal solution. Moreover, as two popular integrate expert knowledge and heuristic inference methods into reinforcement learning algorithms for the control of continuous the reinforcement algorithms. The third phase is to train designed action, DDPG and NAF have their advantages in different fields. algorithms and get a stable model. To compare their performance, in this work, we have designed For the real-world application of STO, it usually includes STOD and STON. the last two steps. Above all, we will follow three phases of Reinforcement learning can allow agents to automatically take STO to establish a stable model before putting it into practice. proper action by maximizing their reward [30]. As a powerful Then, during the trip time of the train equipped with STO, the decision-making tool [31], reinforcement learning has been used system will accept the real-time information about its position, its to deal with optimal control problems in many different fields, velocity and its running time obtained from onboard sensors, and such as aerobatic helicopter flight control [32], playing robot will send the command about the control of traction or braking soccer game [33], power systems stability control [34] and so force. on. There are two reasons which drive us to adapt reinforcement learning in the train control task. Firstly, some reinforcement learning algorithms can realize the control for continuous action B. Expert knowledge and heuristic inference rules [35] which can improve the current control strategy for discrete As the train control problem has features of non-linearity action in the ATO system. Secondly, reinforcement learning pays and complexity, it is hard to design an ideal model without attention to long-term rewards, while the train’s current action taking expert knowledge into account. And knowledge-based also influences its follow-up steps. technology has been successfully applied to solve complicated 1) Markov decision process: Before applying the reinforce- optimal control problem [28]. According to the previous works ment learning algorithm, we formulate our problem into a Markov [1] [24] [29], we found that those optimal train operation methods decision process (MDP) which provides a mathematical frame- always follow certain expert rules which can be listed as: work for decision making. The critical elements of reinforcement learning include its state, action, policy, and reward, which are • The train operation has three states, including the accelerat- defined as follows: ing, the coasting, and the braking as shown in Fig. 2. Unless encountering special accidents, the train wouldn’t transfer • State x. In this case, the speed and the velocity, two directly from accelerating state to braking state and vice important train movement factors, make up the state. Thus, verse. The transfer between any other two states is allowable. it can be described as • For the sake of protecting the engine and ensuring the xi = [si, vi], (17) riding comfort, the acceleration of the train shouldn’t exceed 0.6m/s2 when the subway starts its operation. where 0 ≤ i ≤ n. And the initial state x0 is defined as

Besides, experienced drivers know well about when the train x0 = [0, 0]. (18) should decelerate to guarantee the safety of train operation, and • Action . there is no designed speed profile for STO. Thus, we develop the a During the trip time, the acceleration ui is defined as the action. And the range of acceleration is defined as: heuristic inference method to ensure that the designed model will work properly. The heuristic inference method is listed as: ui ∈ [−1, 1] for the subway operation in YLBS. Thus the limit Action a is defined as • When the velocity limit vj+1 of the next section j +1 is limit (19) less than the velocity limit vj of the current section j ai = [ui]. shown in Fig. 3, the train may have to rationally brake to • Policy π. The policy π denotes the probability of taking guarantee the safety of the train. In other words, the speed an action when dealing with a discrete action task. In this of the train should always be inferior to the speed limit. In safe paper, as STO is designed to deal with continuous action this case, we define the safe velocity vi to supervise the control task, the policy π is the statistics of the probability speed of the train as distribution. It can be expressed as

safe limit 2 limit vi = β(vj+1 ) − 2umin(sj+1 − si), (16) π(a|x, θ)= N(µ(x, θ),sigma(x, θ)), (20) q 4 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

Trip time Speed States Distance Acceleration limits Online data Conversion rules DDPG Historical data Summarize Integrate Train Input from experienced NAF Expert knowledge drivers Smart control model Reinforcement leanring

Safe velocity Output

Inference method Optimal driving strategy

Fig. 3: STO model.

Noticing D equals to 1 only when the train arrived, thus 6SHHG PV this equation can use to calculate the difference between the whole running time and the planning trip time. 2) STOD: STOD algorithm is based on the reinforcement learning algorithm DDPG which is an actor-critic, model-free algorithm that can deal with continuous action control problems, based on policy-gradient algorithm [31]. The reinforcement learn- ing setup consists of an agent interacting with an environment E and we denote the discounted state visitation distribution for a policy π as ρπ. In the DDPG, the critic-network is used to estimate the action-value function, while the actor-network is used 'LVWDQFH P to improve the policy function with the help of critic-network. Besides, we use θQ to represent the weight of action-value function Q(x, a|θQ) and use θµ to represent the weight of policy Fig. 4: Safety velocity. function a = µ(x|θµ). The loss function L for critic-network is described in Eq.(22), and θQ is updated through minimizing the where θ is the weight. loss function. • Reward function r(xi,ai). This function defines the reward Q 2 L(θ)= E β [(Q(x ,a |θ ) − y ) ], (25) obtained by the train when it takes an action at a certain state. xi∼ρ ,ai∼β,ri∼E i i i In this case, our reward function is defined by the energy where β represents a stochastic behavior policy, and target value consumption per weight ∆Ie during the time interval ∆t yi is described as when train takes the action ai at the state xi, and the time Q ′ ′ yi = r(xi,ai)+ γQ(xi+1,µ(xi+1)|θ ), (26) error eti. The time error eti is used to ensure that the agent should arrive at the destination within planning trip time where γ describes the discount rate. rather than spending too much on running with low speed The return from a state xi is defined as the sum of the future to minimize the energy consumption. The reward function discounted reward in is defined in n j−i Ri = γ r(xj ,aj ). (27) ′ r(xi,ai)= −λ1∆Ie −λ2eti −λ3∆Ic −λ4D−λ5Acc, (21) Xj=i

where λ1, λ2, λ3, λ4 and λ5 are the coefficients defined to The goal of actor-network is to maximize the return from ′ E meet different requirements of system; the time error eti at the start distribution J = ri,xi∼E,ai∼π[R1]. In the traditional Q moment ti is defined as Q-learning, the network Q(x, a|θ ) is used to calculate target value yi and is also updated based on the target value. This ′ 1 ti >Tplanning eti = ; (22) method will increase the instability of the Q network, as during  0 ti ≤ Tplanning the training the process, the Q network is constantly updated. D is used to check whether the train has arrived the If we use a constantly changing value as our target value to destination and stopped at the correct position. Its expression update the network, the feedback loops between the target and can be written as estimated Q-values will destabilize the Q network [36] [37]. 1 si >SDestination To solve this problem, the target network is implemented. In D = ; (23) ′  0 si ≤ SDestination DDPG, there is a target actor-network µ′(x|θµ ) and a target critic ′ ′ ′ Acc is used to guarantee that the running time in the expected network Q′(x, a|θQ ). Their weight θµ and θQ are updated by range and it can be defined as the following equations: ′ ′ µ µ µ Acc = D ∗ |ti − Tplanning|. (24) θ ← τθ + (1 − τ)θ (28)

5 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488. and P is a state-dependent, positive-definite square matrix. ′ ′ P (x|θ ) θQ ← τθQ + (1 − τ)θQ , (29) With the Cholesky decomposition method, it can be described as where τ ≪ 1. It indicates that the weights of two target networks P (x|θP )= L(x|θP )L(x|θP )T , (32) are updated more slowly than the weights of the actor-network where L(x|θP ) is a lower-triangular matrix outputted by the and the critic network, which can improve the stability of the neural network. And the network is updated by minimizing its learning process. The STOD algorithm is given in Algorithm 1. 1 Q 2 loss function L = N i(yi − Q(xi,ai|θ )) . In this algorithm, Algorithm 1 STOD algorithm the target network willP also be introduced to improve the stability // Initilize parameters of STOD of the learning process. And the STON algorithm is described in Randomly initialize normalized actor network µ(x|θµ) and ini- algorithm 2. ′ µ µ tialize target actor network θ ← θ Algorithm 2 STON algorithm Randomly initialize normalized critic network Q(x, a|θQ) and ′ // Initilize parameters of STON initialize target critic network θQ ← θQ Randomly initialize normalized Q network Q(x, a|θQ) and ini- Initialize reply buffer R ←∅ ′ tialize target Q′ network θQ ← θQ // Excute the networks Initialize reply buffer R ←∅ for episode=1,M do do // Excute the networks Initialize a random process N for action exploration for episode=1,M do do Initialize observation state x ← [0, 0] 0 Initialize a random process N for action exploration for i=1,N do do Initialize observation state x ← [0, 0] Obtain action a = µ(x |θµ)+ N 0 i i i for i=1,N do do Verify the obtained action a with expert knowledge i Obtain action a = µ(x |θµ)+ N and inference method. If action a doesn’t meet those i i i i Verify the obtained action a with expert knowledge requirements, adjusting the obtained action a i i and inference method. If action a doesn’t meet those Execute action a and observe reward r and state i i i requirements, adjusting the obtained action a x according to the subway motion equation i i+1 Execute action a and observe reward r and state x Store transition (x ,a , r , x ) in buffer R i i i+1 i i i i+1 according to the subway motion equation // Update the weights Store transition (x ,a , r , x ) in buffer R Randomly sample a minibatch of N transitions i i i i+1 // Update the weights (xj ,aj , rj , xj+1) from buffer R ′ ′ for iteration = 1, I do do Calculate: y = r + γQ′(x ,µ′(x |θµ )|θQ ) j j j+1 j+1 Randomly sample a minibatch of m transition Update critic network by minimizing the loss function: (xj ,aj, rj , xj ) from buffer R 1 Q 2 +1 ′ L = N j (Q(xj ,aj |θ ) − yj ) ′ Q Calculate: yj = rj + γV (xj+1|θ ) UpdateP actor network through policy gradient: 1 Q µ Update critic network by minimizing the loss function: µ µ ∇θ J ≈ N j ∇aQ(x, a|θ )|x=xj ,a=µ(xj )∇θ µ(x|θ )|sj P L = 1 (y − Q(x ,a |θQ))2 Update the target networks: N j j j j ′ ′ Q Q Q UpdateP the target networks: θ ← τθ + (1 − τ)θ ′ ′ ′ ′ θQ ← τθQ + (1 − τ)θQ θµ ← τθµ + (1 − τ)θµ end for end for end for end for end for 3) STON: STON algorithm is based on the reinforcement 4) ITOR: ITOR algorithm is a new ATO algorithm proposed in learning algorithm NAF which is a reinforcement learning method [10], which employs the deep Q-learning algorithm to construct designed for continuous control tasks and works as an alternative the framework. Due to the limitation of deep Q-learning, ITOR to commonly used policy gradient and actor-critic methods, such can only realize the discrete action control of the train, whose as DDPG. Plus, it allows users to use the Q-learning method performance will be compared with STOD and STON in the latter to deal with the control tasks for continuous action, thus STON experiment. algorithm is simpler than the STOD algorithm. Q-learning is not suitable for dealing with continuous action tasks, as it should IV. SIMULATIONS maximize a complex, nonlinear function at each update. And To verify the effectiveness, the flexibility and the robustness of the idea behind NAF is to represent the Q-function Q(xi,ai) in STOD and STON, we have designed three numerical simulation the way that its maximum, argmaxaQ(xi,ai) can be determined during the Q-learning update [38]. In NAF, the neural network experiments based on real field data collected from YLBS. The output separately the value function V (x) and the advantage term YLBS started operation on 30th December 2010 in Beijing. A(x, a), which are defined as follow: The total length of YLBS is up to 23.3km and it starts from Songjiazhuang station and ends at Ciqu station as in Fig. 5. The Q A v Q(x, a|θ )= A(x, a|θ )+ V (x|θ ), (30) type of train used in YLBS is DKZ32 EMU which has six vehicles and and its parameters are given in Table. I. 1 During the training process of ITOR, STOD, and STON, we A(x, a|θA)= − (a − µ(x|θµ))T P (x|θp)(a − µ(x|θµ)). (31) employ the Adam optimizer and set the learning rate as 1 × 10−4 2 6 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

TABLE I: Parameters of DKZ32 6RQJMLD=KXDQJ

;LDR&XQ Parameters Value ;LDRKRQJ0HQ M (kg) 1.99 × 105 4 mi,i = 1, 6 (kg) 3.3 × 10 4 mi,i = 3 (kg) 2.8 × 10 :HQKXD\XDQ

with a negative number. From Fig. 8, all three algorithms reach for all the networks training, except the training for the critic a relatively stable phase after 1050 episodes. However, one thing −5 network of the STOD whose learning rate is 5 × 10 . The τ worth to be noticed is that compared with ITOR, STOD and −3 in the target network is set as 1 × 10 . The discount factor for STON always perform better with a higher reward. Even after the value function is 0.99 and the size of the mini-batch for the 1750 episodes, ITOR algorithm still fluctuates a lot. Thus STOD memory reply is 256. As to the weight coefficients of the reward and STON are more stable than ITOR. Fig. 9 shows the running function, λ1 =0.13, λ2 = 30, λ3 = 10, λ4 = 400 and λ5 = 70. time of the three algorithms. Within our expectation, we find that For the ITOR, it has five hidden layers. The first hidden layer ITOR has less running time due to its simple structure and its has 400 units; the second layer has 300 units; the third layer has discrete action space. STOD and STON have longer running time, 200 units; the fourth layer has 100 units and the last layer has 32 because of their more complicated network structure and their units. Each one of the first four hidden layers is followed by a continuous action space. Relu activation function and the last hidden layer doesn’t have any Fig. 10 presents speed distance profiles for 101s trip time of activation layer. For parameters of STOD, both its actor-network the four models. We can learn that the speed profile of the manual and critic network have five hidden layers. The first layer has driving can be obviously divided into a full accelerating phase, 400 units; the second layer has 300 units; the third layer has 200 a coasting phase, and a full braking phase. As to the maximum units; the fourth layer has 100 units and the last layer has 32 speed, manually driving speed profile has the maximum speed units. Each one of the first four hidden layers is followed by a 18.86m/s. The speed-distance profile of ITOR has the highest Relu activation function; the last hidden layer of actor-network maximum speed compared with other three profiles. Its profile is followed by a Tanh activation function; the last hidden layer can be divided into four phases including a full accelerating of critic network doesn’t have any activation layer. The target phrase, an accelerating phase, a coasting phrase, and a full braking actor-network shares the same structure with the actor-network, phrase. Its maximum speed is 18.98m/s. The speed-distance and the target critic network shares the same structure with the profiles of STOD and STON are similar. Both of them have a critic network. For the STON, it has five hidden layers. The first lower maximum speed than that of manual driving and ITOR, hidden layer has 400 units; the second layer has 300 units; the which indicates that they may have lower energy consumption. third layer has 200 units; the fourth layer has 100 units and the The maximum speed of STOD is 18.08m/s, while the maximum last layer has 32 units. Each one of the first four hidden layers is speed of STON is 17.93m/s. followed by a Tanh activation function. The target network shares Table. II gives the comparison in the aspect of four evaluation the same structure. indices and the running time. We can learn that the punctuality In this section, we will present the simulation result of three evaluation indices of four frameworks satisfy the requirement of case 1 cases. In , a comparison between manual driving data of YLBS. As to the safety evaluation indices, all four models meet the experienced driver, ITOR, STOD, and STON, is derived. In the requirement. Among the four models, ITOR has the largest case 2, we test the flexibility of ITOR, STOD, and STON by energy consumption. Compared with the manual driving, ITOR changing the planning trip time of the same railway section. In costs 1.7% of energy more than the manual driving; STOD costs case 3, we alter the gradient condition of the railway section to 9.4% of energy less than manual driving and STON costs 11.7% verify the robustness of ITOR, STOD, and STON. of energy consumption less than manual driving. In the aspect of the riding comfort for four models, the manual driving has the A. Case 1 highest Ic which indicates the worst passenger comfort, while In the first case, we use the field data collected railway section ITOR, STOD, and STON have similar value for riding comfort between Rongjing East Street station and WanyuanStreet station evaluation index which is much smaller than that of the manually in YLBS. The whole length of this section is 1280m and the driving. Among them, STON has the best Ie and Ic with the trip planning trip time is 101s. The speed limit information of this time 101s. section is shown in Fig. 6 and the gradient condition of this section are described in Fig. 7. In order to find the best manual driving data, we have analyzed 100 groups of up trains and down B. Case 2 trains of this section from May 1, 2015, to May 27, 2015. We verify the flexibility of ITOR, STON, and STOD through Fig. 8 shows the training process for ITOR, STOD, and STON. conducting the simulation with the different trip times in the Due to our definition of our reward, the reward always begins same railway section. As in the real-time subway operation, the

7 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

Fig. 6: Speed limits between Rongjing East Street station and Fig. 9: Runing time for the learning process of the ITOR, Wanyuan Street station. STOD and STON.

Fig. 7: Gradient condition between Rongjing East Street station and Wanyuan Street station. Fig. 10: Comparison of speed distance profile with 101s trip time.

TABLE II: Comparison of performance with different trip time and gradient condition

Evaluation Indices t It Is Ie Ic Manual Driving (101s) 102s 1 1 586.82 9.05 ITOR Driving (101s) 102s 1 1 597.21 4.60 STOD (101s) 102s 1 1 531.77 4.58 STON (101s) 102s 1 1 518.03 4.56 Manual Driving (95s) 96s 1 1 811.77 14.00 ITOR Driving (95s) 97s 1 1 854.24 8.8 STOD (95s) 96s 1 1 741.01 5.84 STON (95s) 96s 1 1 740.29 5.27 Manual Driving (115s) 116s 1 1 325.05 7.50 ITOR Driving (115s) 116s 1 1 326.58 3.80 STOD (115s) 116s 1 1 344.76 5.80 STON (115s) 115s 1 1 320.06 4.01 ITOR (New Gradient) 102s 1 1 619.11 4.60 STOD (New Gradient) 102s 1 1 568.56 4.41 STON (New Gradient) 103s 1 1 467.62 2.23 Fig. 8: Learning curve of the ITOR, STOD and STON.

8 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

Fig. 11: Comparison of speed distance profile with 95s trip time. Fig. 12: Comparison of speed distance profile with 115s trip time. technical accident and the large crowd during the morning and evening can largely influence the trip time of the subway. To more than manual driving; STOD costs 6.1% of energy more than ensure the normal operation of the whole line, the subway needs the manual driving and STON costs 1.5% of energy less than the to change its control strategy. We conducted two simulations with manual driving. The comfort evaluation index of STOD is higher different trip times, including one with 95s planning trip time than that of the other three models, as its acceleration changes and one with 115s planning trip time. In this subsection, we will several times during the accelerating phase. This time, the Ic of compare the performance of manually driving, ITOR, STOD, and ITOR is lower than that of STON while the Ie of STON is lower STON with different planning trip times. than that of ITOR. Fig. 11 presents speed distance profiles for 95s trip time of the Through the analysis above, we can conclude that ITOR, four models. Similar to the case with 101s trip time, the speed- STON, and STOD can produce rational control strategy and distance profile of ITOR has the highest maximum speed, which satisfy all requirements when the planning trip time is changed, indicates that ITOR has the largest energy consumption. The thus the flexibility of ITOR, STOD, and STON is proved. speed profiles of STOD and STON are very similar, as they have the same maximum speed 20.99m/s. Compared with the speed- C. Case 3 distance profile under 101s planning trip time, they have a higher maximum speed and short coasting distance which indicates To test the robustness of STOD and STON, we will change higher energy consumption and worse passenger comfort. the gradient condition in the same railway section of YLBS. Even We can learn from the Table. II that compared with the manual though in most cases, the gradient condition is stable, other factors driving, ITOR costs 5.2% of energy more than manual driving; like the wet weather and the railway aging are able to change STOD costs 8.7% of energy less than the manual driving and the resistance condition of the railway. In this case, through STON costs 8.8% of energy less than the manual driving. In the changing the gradient condition, we can simulate the situation of aspect of the riding comfort for four approaches, manual driving the changing resistance, which can be used to test the robustness has the largest Ic, while ITOR, STOD, and STON have similar Ic of STOD and STON. The new gradient condition is shown in which is much smaller than that of the manually driving method. Fig. 13. Among them, STON has the best Ic and Ie with the trip time Fig. 14 presents the comparison of speed distance profiles with 95s. the new gradient condition. The maximum speed of the ITOR is Fig. 12 presents the comparison of speed distance profiles with 18.76m/s; the maximum speed of the STOD is 18.09m/s and 115s trip time. The maximum speed of the manually driving is the maximum speed of the STON is 16.97m/s. We can learn 14.73m/s; the maximum speed of the ITOR is 14.69m/s; the that ITOR has the highest maximum speed than the other two maximum speed of the STOD is 15.07m/s and the maximum methods. speed of the STON is 14.65m/s. We can learn that the STOD We can learn from the Table. II that all three models satisfy has the highest maximum speed than the other three methods. the requirement of punctuality and safety. With the same planning Compared with their speed distance profiles under 95s and 101s trip time, the energy consumption of ITOR and STOD are a little planning trip time, their maximum speeds are much lower than higher than that in Case 1, as new gradient condition increases the that of previous cases which denotes that they have lower average resistance of the section where subway accelerates and decreases speed and lower energy consumption. the resistance of the section where subway decelerates. The speed We can learn from the Table. II that in terms of punctuality profile given by the STON has lower energy consumption, while and safety, all four models satisfy the requirements. As to energy the main reason is that the running time of STON is 103s rather consumption, the STOD has the highest energy consumption. than 101s, which is 2s later than expected. However, according Compared with the manually driving, ITOR costs 0.5% of energy to the definition of punctuality which indicates that time errors

9 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

inference methods. Then we integrate expert knowledge rules and inference methods into reinforcement learning algorithms, DDPG and NAF. Finally, three case studies based on YLBS are used to illustrate the effectiveness, the flexibility and the robustness of STOD and STON. The performance of the proposed STOD and STON is compared with the performance of the existing ATO algorithm. The result shows that STOD and STON perform better than manually driving and existing ATO algorithm. Two proposed models own certain flexibility and robustness, which allows them to deal with the variability of subway operation tasks. Moreover, STON generally performs better than STOD in the comparisons of simulation results for three cases listed above. Besides the feature of dealing with control problems for continuous action, STOD and STON also make use of the expert knowledge and inference methods, which largely increases the stability of algorithms. In addition, STOD and STON can meet multiple objectives of train operation and realize model-free train Fig. 13: New gradient condition between Rongjing East Street operation control. station and Wanyuan Street station. However, despite these advantages mentioned above, proposed models are still improvable. For example, the flexibility of STOD and STON is limited. If a great change in the planning trip time is made, STOD and STON cannot generate a desirable control strategy. Moreover, it is hard to apply these models to high- speed train cases with long-distance and complicated speed limits between two successive stations, which decreases the convergence speed of the learning process of algorithms and also increases the instability of models. Our future research will focus on these aspects.

VI. ACKNOWLEDGEMENT The first author really appreciates the support and the company from his colleagues, Peng Jiang, and Zeyu Zhao, in the Depart- ment of Automation of Tsinghua University.

REFERENCES [1] J. Yin, D. Chen, and Y. Li, “Smart train operation algorithms based on expert knowledge and ensemble cart for the electric locomotive,” Knowledge-Based Systems, vol. 92, pp. 78–91, 2016. Fig. 14: Comparison of speed distance profile with new gradient [2] F. Corman and L. Meng, “A review of online dynamic models and algo- rithms for railway traffic management,” IEEE Transactions on Intelligent condition. Transportation Systems, vol. 16, no. 3, pp. 1274–1284, 2015. [3] A. R. Albrecht, P. G. Howlett, P. J. Pudney, and X. Vu, “Energy-efficient train control: from local convexity to global optimization and uniqueness,” Automatica, vol. 49, no. 10, pp. 3072–3078, 2013. inferior to 3s are allowable, the STON still provides a good result. [4] Y. Wang, B. Ning, T. Tang, T. J. Van Den Boom, and B. De Schutter, The comfort evaluation index of ITOR and STOD are similar to “Efficient real-time train scheduling for urban rail transit systems using iter- that in Case 1, whereas the comfort evaluation index of STON ative convex programming,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 6, pp. 3337–3352, 2015. is lower than that in Case 1, as both the accelerating phrase and [5] X. Yang, B. Ning, X. Li, and T. Tang, “A two-objective timetable op- the decelerating phase of STON speed distance profile, in this timization model in subway systems,” IEEE Transactions on Intelligent case, are smoother than that in Case 1. We can learn from the Transportation Systems, vol. 15, no. 5, pp. 1913–1921, 2014. [6] W. ShangGuan, X.-H. Yan, B.-G. Cai, and J. Wang, “Multiobjective opti- result listed above, that ITOR, STOD, and STON are capable to mization for train speed trajectory in ctcs high-speed railway with hybrid provide satisfactory results even when the resistance condition evolutionary algorithm,” IEEE Transactions on Intelligent Transportation varies, hence the robustness of ITOR, STOD, and STON are Systems, vol. 16, no. 4, pp. 2215–2225, 2015. [7] E. Khmelnitsky, “On an optimal control problem of train operation,” IEEE verified. Transactions on Automatic Control, vol. 45, no. 7, pp. 1257–1266, 2000. [8] S. Açıkba¸s and M. Söylemez, “Coasting point optimisation for mass rail transit lines using artificial neural networks and genetic algorithms,” IET ONCLUSION V. C Electric Power Applications, vol. 2, no. 3, pp. 172–182, 2008. To build an intelligent train operation model that can deal with [9] L. Yang, K. Li, Z. Gao, and X. Li, “Optimizing trains movement on a railway network,” Omega, vol. 40, no. 5, pp. 619–633, 2012. the control task for continuous action of the subway, we propose [10] J. Yin, D. Chen, and L. Li, “Intelligent train operation algorithms for two algorithms, STOD and STON which integrate the expert subway by expert system and reinforcement learning,” IEEE Transactions knowledge of experienced drivers with reinforcement learning on Intelligent Transportation Systems, vol. 15, no. 6, pp. 2561–2571, 2014. [11] W. Liu, J. Han, and X. Lu, “A high speed railway control system based methods. Firstly, we collect enough driving data of experienced on the fuzzy control method,” Expert Systems with Applications, vol. 40, drivers, from whom we extract expert knowledge and build no. 15, pp. 6115–6124, 2013.

10 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

[12] T.-S. Wu, M. Karkoub, C.-C. Weng, and W.-S. Yu, “Trajectory tracking for [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, uncertainty time delayed-state self-balancing train vehicles using observer- and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv based adaptive fuzzy control,” Information Sciences, vol. 324, pp. 1–22, preprint arXiv:1312.5602, 2013. 2015. [37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, [13] Q. Gu, T. Tang, F. Cao, and Y.-d. Song, “Energy-efficient train operation in A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human- urban rail transit using real-time traffic information,” IEEE Transactions on level control through deep reinforcement learning,” Nature, vol. 518, no. Intelligent Transportation Systems, vol. 15, no. 3, pp. 1216–1233, 2014. 7540, p. 529, 2015. [14] S. Li, L. Yang, K. Li, and Z. Gao, “Robust sampled-data cruise control [38] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning scheduling of high speed train,” Transportation Research Part C: Emerging with model-based acceleration,” in International Conference on Machine Technologies, vol. 46, pp. 274–283, 2014. Learning, 2016, pp. 2829–2838. [15] R. R. Liu and I. M. Golovitcher, “Energy-efficient operation of rail vehicles,” Transportation Research Part A: Policy and Practice, vol. 37, no. 10, pp. 917–932, 2003. [16] B. P. Rochard and F. Schmid, “A review of methods to measure and calculate train resistances,” Proceedings of the Institution of Mechanical Engineers, Part F: Journal of Rail and , vol. 214, no. 4, pp. 185–199, 2000. Kaichen Zhou received the Master degree in Machine [17] Y. Wang, B. De Schutter, T. J. van den Boom, and B. Ning, “Optimal Learning in the Department of Computing at Imperial trajectory planning for trains–a pseudospectral method and a mixed integer College London. He is currently pursuing the P.h.D linear programming approach,” Transportation Research Part C: Emerging degree in Computer Science in the Department of Com- Technologies, vol. 29, pp. 97–114, 2013. puter Science at University of Oxford. [18] Q. Song, Y.-d. Song, T. Tang, and B. Ning, “Computationally inexpensive His research interests include deep learning and rein- tracking control of high-speed trains with traction/braking saturation,” IEEE forcement learning. Transactions on Intelligent Transportation Systems, vol. 12, no. 4, pp. 1116– 1125, 2011. [19] D. Chen, R. Chen, Y. Li, and T. Tang, “Online learning algorithms for train automatic stop control using precise location data of balises,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 3, pp. 1526– 1535, 2013. [20] W. ZHENG and H. Xu, “Modeling and safety analysis of maglev train over-speed protection based on stochastic petri nets,” Journal of the Society, vol. 31, no. 4, pp. 59–64, 2009. [21] N. O. Olsson and H. Haugland, “Influencing factors on train punctual- Shiji Song received the Ph.D. degree in Mathematics ity—results from some norwegian studies,” Transport Policy, vol. 11, no. 4, from the Department of Mathematics, Harbin Institute pp. 387–397, 2004. of Technology, Harbin, China, in 1996. [22] M. Miyatake and H. Ko, “Optimization of train speed profile for minimum He is currently a Professor with the Department of energy consumption,” IEEJ Transactions on Electrical and Electronic En- Automation, Tsinghua University, Beijing, China. His gineering, vol. 5, no. 3, pp. 263–269, 2010. current research interests include system modeling, con- trol and optimization, computational intelligence, and [23] K. Karakasis, D. Skarlatos, and T. Zakinthinos, “A factorial analysis for pattern recognition. the determination of an optimal train speed with a desired ride comfort,” Applied Acoustics, vol. 66, no. 10, pp. 1121–1134, 2005. [24] R. Cheng, D. Chen, B. Cheng, and S. Zheng, “Intelligent driving methods based on expert knowledge and online optimization for high-speed trains,” Expert Systems with Applications, vol. 87, pp. 228–239, 2017. [25] J. Powell and R. Palacín, “Passenger stability within moving railway vehicles: limits on maximum longitudinal acceleration,” Urban Rail Transit, vol. 1, no. 2, pp. 95–103, 2015. [26] L. L. Hoberock, “A survey of longitudinal acceleration comfort studies in Anke Xue received the Ph.D. degree in Automation ground transportation vehicles,” Journal of Dynamic Systems, Measurement, from the College of Control Science and Engineering, and Control, vol. 99, no. 2, pp. 76–84, 1977. Zhejiang University, Hangzhou, China, in 1997. [27] F. Ruelens, S. Iacovella, B. J. Claessens, and R. Belmans, “Learning He is currently a Professor and the President of agent for a heat-pump thermostat with a set-back strategy using model-free Hangzhou Dianzi University, Hangzhou. His research reinforcement learning,” Energies, vol. 8, no. 8, pp. 8300–8318, 2015. interests include robust control theory and applications. [28] S. Murrell and R. T. Plant, “A survey of tools for the validation and verification of knowledge-based systems: 1985–19951,” Decision Support Systems, vol. 21, no. 4, pp. 307–323, 1997. [29] N. Zhao, C. Roberts, S. Hillmansen, Z. Tian, P. Weston, and L. Chen, “An integrated metro operation optimization to minimize energy consumption,” Transportation Research Part C: Emerging Technologies, vol. 75, pp. 168– 182, 2017. [30] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135. [31] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” Journal of Artificial Rntelligence Research, vol. 4, pp. 237–285, Keyou You received the B.S. degree in Statistical Sci- 1996. ence from Sun Yat-sen University, Guangzhou, China, in [32] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, “An application of 2007 and the Ph.D. degree in Electrical and Electronic reinforcement learning to aerobatic helicopter flight,” in Advances in Neural Engineering from Nanyang Technological University Information Processing Systems, 2007, pp. 1–8. (NTU), Singapore, in 2012. He is now an Associate [33] Y. Duan, Q. Liu, and X. Xu, “Application of reinforcement learning in robot Professor in the Department of Automation, Tsinghua soccer,” Engineering Applications of Artificial Intelligence, vol. 20, no. 7, University, Beijing, China. He held visiting positions at pp. 936–950, 2007. Politecnicodi Torino, Hong Kong University of Science [34] D. Ernst, M. Glavic, and L. Wehenkel, “Power systems stability control: and Technology, University of Melbourne and etc. reinforcement learning framework,” IEEE Transactions on Power Systems, His current research interests include networked con- vol. 19, no. 1, pp. 427–435, 2004. trol systems, parallel networked algorithms, and their [35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, applications. and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.

11 PREPRINT VERSION. PUBLISHED IN IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS - https://ieeexplore.ieee.org/abstract/document/9144488.

Hu Wu Hui Wu received the B.S. degree in Automation from the Department of Automation, Tsinghua Univer- sity, Beijing, China, in 2014, where he is currently pursu- ing the Ph.D. degree in control science and engineering with the Department of Automation, Institute of System Integration in Tsinghua University. His current research interests include reinforcement learning and robot control, especially in continuous control for underwater vehicles.

12