4378 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 5, MAY 2018 Fast Functional Safety Verification for Distributed Automotive Applications During Early Design Phase Guoqi Xie , Member, IEEE, Gang Zeng, Member, IEEE, Yan Liu , Jia Zhou , Renfa Li , Senior Member, IEEE, and Keqin Li , Fellow, IEEE
Abstract—Both response time and reliability are impor- I. INTRODUCTION tant functional safety properties that must be simultane- ously satisfied learning from the automotive functional A. Motivation safety standard ISO 26262. Safety verification pertains to checking if an application meets a safe set of design spec- UTOMOTIVE system is a highly safety-critical indus- ifications and complies with regulations. Introducing veri- A trial electronic system. Many active and passive safety fication in the early design phase not only complies with applications have been developed to enhance safe driving, such the latest automotive functional safety standard but also as antilock braking system, brake-by-wire, and adaptive cruise avoids unnecessary design effort or reduces the design control [1]. In particular, the road vehicles—functional safety burden of the late design optimization phase. This study presents a fast functional safety verification (FFSV) method standard ISO 26262 was officially released in 2011 for adapting for a distributed automotive application during the early de- safety of automotive applications [2]–[4]. Functional safety has sign phase. The first method FFSV1 finds the solution with become the preferential direction of automotive system devel- the minimum response time under the reliability require- opment, and it refers to the absence of unreasonable risk caused ment, and the second method FFSV2 finds the solution by systematic failures and random hardware failures [2]. with the maximum reliability under the response time re- quirement. We combine FFSV1 and FFSV2 to create union Safety usually refers to satisfying the response time re- FFSV (UFFSV), which can obtain acceptance ratios higher quirement (i.e., real-time requirement, timing constraint, and than those of current methods. Experiments on real-life deadline constraint) and reliability requirement (i.e., reliability and synthetic distributed automotive applications show that goal, reliability assurance, and reliability constraint) of an appli- UFFSV can obtain higher acceptance ratios than their exist- cation. Safety verification pertains to checking if an application ing counterparts. meets a safe set of design specifications and complies with Index Terms—Automotive functional safety, ISO 26262, regulations. Automotive industry is cost sensitive to the mass verification. market, and thus, the development cost, hardware cost, and re- Manuscript received April 18, 2017; revised July 28, 2017 and Septem- source cost design optimization for safety-critical distributed ber 16, 2017; accepted October 5, 2017. Date of publication October 12, automotive applications have been studied [5]–[8]. However, 2017; date of current version January 16, 2018. This work was supported the aforementioned works only focused on either satisfying the in part by the National Key Research and Development Plan of China under Grant 2016YFB0200405, in part by the National Natural Science response time or reliability requirement rather than functional Foundation of China under Grant 61702172, Grant 61672217, Grant safety requirement. Response time and reliability requirements 61173036, Grant 61379115, Grant 61402170, Grant 61370097, Grant are nonfunctional requirements in requirements engineering dis- 61502162, and Grant 61502405, in part by the CCF-Venustech Open Research Fund under Grant CCF-VenustechRP2017012, in part by the cipline [9]; however, response time and reliability are important CERNET Innovation Project under Grant NGII20161003, and in part by functional safety properties learning from the ISO 26262 stan- the China Postdoctoral Science Foundation under Grant 2016M592422. dard; their requirements must be simultaneously satisfied for (Corresponding author: Yan Liu.) G. Xie, Y. Liu, J. Zhou, and R. Li are with the College of Com- automotive functional safety [2]. Before cost design optimiza- puter Science and Electronic Engineering, Hunan University, Changsha tion, we should verify the feasibility of design optimization. 410082, China, and also with the Key Laboratory for Embedded and Net- If design optimization is infeasible, then designers can avoid work Computing of Hunan Province, Changsha 410082, China (e-mail: [email protected]; [email protected]; [email protected]; unnecessary design effort. If it is feasible, then designers can [email protected]). reduce the design burden by using verification results as basis G. Zeng is with the Graduate School of Engineering, Nagoya Univer- because verification is part of the design process. Introducing sity, Nagoya 4648603, Japan (e-mail: [email protected]). K. Li is with the College of Computer Science and Electronic Engi- verification in the early design phase not only complies with neering, Hunan University, Changsha 410082, China, and also with the the latest automotive functional safety standard but also avoids Department of Computer Science, State University of New York, New unnecessary design effort or reduces the design burden of the Paltz, NY 12561 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available late design optimization phase. online at http://ieeexplore.ieee.org. A directed acyclic graph (DAG) can be used to represent a dis- Digital Object Identifier 10.1109/TIE.2017.2762621 tributed automotive application with end-to-end computation,
0278-0046 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. XIE et al.: FAST FUNCTIONAL SAFETY VERIfiCATION FOR DISTRIBUTED AUTOMOTIVE APPLICATIONS DURING EARLY DESIGN PHASE 4379
Fig. 1. Pareto curve for a bicriteria between response time and expo- sure [10], [11]. in which the nodes represent tasks and the edges represent the communication messages between tasks [1], [12]. The prob- lem is that response time and reliability may not be satisfied simultaneously in practice because increasing reliability intu- itively increases the response time of a DAG-based distributed Fig. 2. Overview of this study for fast functional safety verification. application [10], [11]. ISO 26262 defines the exposure to rep- resent the relative expected frequency of the operational condi- tions, in which hazardous events may occur and cause hazards independently, the functional safety requirement contain- and injuries [2]. That is, reliability is just the inverse expres- ing reliability requirement and response time requirement sion of exposure. Response time minimization and exposure may not be satisfied, because reliability maximization minimization (i.e., reliability maximization) are conflicting pro- and response time minimization are conflicting. cesses, such that verifying functional safety is a bicriteria optima 2) In the early design phase, we propose two FFSV meth- problem. In Fig. 1, each point x1 –x7 represents a solution of a ods. One method called FFSV1 is to find the solution with bicriteria minimization problem [10], [11]. The points x1 , x2 , the minimum response time under the reliability require- x3 , x4 , and x5 are Pareto optima; the points x1 and x5 are weak ment, and the second method called FFSV2 is to find the optima, whereas the points x2 , x3 , and x4 are strong optima. The solution with the maximum reliability under the response set of all Pareto optima is called Pareto curve. In [10], Girault and time requirement. We combine FFSV1 and FFSV2 to Kalla presented a bicriteria scheduling heuristic (BSH) to gen- form a union FFSV (UFFSV). As long as either verifica- erate an approximate Pareto curve of nondominated solutions, tion method returns true, the verification returns true. among which the designers can verify the functional safety by 3) In the late design phase, if at least one of the two meth- finding the points that satisfy the reliability and response time ods can find a solution, then designers can present the requirements simultaneously. However, the time complexity of cost optimization schemes based on the corresponding BSHisashighasO(|N|×2|U |), where |N| and |U| are the num- solutions. ber of tasks and electronic control units (ECUs), respectively. Currently, a high-end automotive system comprises at least 70 C. Contribution of the Study heterogeneous ECUs, and the number of ECUs is expected to increase further in future automotive systems [1], [12]. Consid- The main contributions of this study are to introduce func- ering that the automotive industry is cost-sensitive, shortening tional safety verification in the early design phase for distributed the application’s development cycle to reduce development cost automotive application development and propose two heuristic is crucial. Therefore, a fast functional safety verification (FFSV) verification methods to achieve fast union verification. The de- method with low time complexity should be proposed from a tails are summarized as follows. cost control perspective. 1) The FFSV1 method solves the problem of minimizing re- sponse time under reliability requirement. The problem is divided into two subproblems, namely, satisfying reli- B. Overview of the Study ability requirement and minimizing response time. The A life cycle of industrial software development usually in- first subproblem is solved by transferring the reliability volves the analysis, design, implementation, and testing phases requirement of the application to each task. The second [8], [13]–[15]. In this work, we focus on the early design phase, subproblem is solved by assigning each task to the ECU in which verification is used to determine the feasibility of de- with the minimum earliest finish time (EFT) under sat- sign optimization in advance. The overview of this paper is isfying the reliability requirement of the application. Fi- shown in Fig. 2. The main contents are as follows. nally, the verification result can be judged by comparing 1) In the analysis phase, we first assess the reliability the obtained response time with the given response time requirement in Section III-C and the response time requirement. requirement in Section III-D. Even if the reliability re- 2) The FFSV2 method solves the problem of maximizing re- quirement and the response time requirement are satisfied liability under response time requirement. The problem is 4380 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 5, MAY 2018
also divided into two subproblems, namely, satisfying re- TABLE I sponse time requirement and maximizing reliability. The NOTATIONS IN THIS STUDY first subproblem is solved by transferring the response time requirement of the application to each task. The sec- Notation Definition
ond subproblem is solved by migrating partial tasks to wi,k WCET of the task ni on the ECU uk the ECUs with the maximum reliability values under sat- ci,j WCRT between the tasks ni and nj ranku (ni ) Upward rank value of the task ni isfying the response time requirement of the application. u n pr(i) Assigned ECU of the task i Finally, the verification result can be judged by com- |X| Size of the set X paring the obtained reliability with the given reliability λk Failure rate of the ECU uk n y requirement. seq(y ) th assigned task of the application R(ni ,uk ) Reliability of the task ni on the ECU uk 3) We combine FFSV1 and FFSV2 to form UFFSV. As long R(ni ) Reliability of the task ni as either verification method returns true, the verification Rreq(ni ) Reliability requirement of the task ni R (n ) n returns true, otherwise returns false. max i Maximum reliability of the task i R(G) Reliability of the application G Rmax(G) Maximum Reliability of the application G II. RELATED WORK Rreq(G) Reliability requirement of the application G RR(G) Reliability ratio of the application G As stated in ISO 26262, random hardware failures (i.e., tran- Rrrp(G) Ratio-based reliability pre-assignment of the application G R (n ) n sient failures in most studies) occur unpredictably during the rrp i Ratio-based reliability pre-assignment of the task i LB(G) Lower bound of the application G life cycle of a hardware element, but they follow a probabil- RT(G) Response time of the application G ity distribution [2]. A widely accepted reliability model was EST(ni ,uk ) Earliest start time of the task ni on the ECU uk presented by Shatz and Wang [16], in which the transient fail- EFT(ni ,uk ) Earliest finish time of the task ni on the ECU uk AST(ni ) Actual start time of the task ni ure of each hardware follows a constant-parameter Poisson law AFT(ni ) Actual finish time of the task ni [8], [10]. RTreq(ni ) Response time requirement of the task ni (G) G Automotive industry is cost-sensitive to the mass market, as RTreq Response time requirement of the application pointed out in Section I-A. The development cost, hardware cost, and resource consumption cost design optimization for safety-critical distributed automotive applications were studied in [5]–[8]. In [5] and [6], Gan et al. presented development cost minimization to satisfy the response time requirement by pre- senting genetic algorithm-based and tabu search-based meta- heuristics, respectively. In [7], Gu et al. presented hardware cost minimization to satisfy the response time and security re- quirements. Despite the introduction of these methods, both Fig. 3. Integration architecture of automotive electronic systems [1]. reliability and response time requirements must be simultane- ously satisfied learning from automotive functional safety stan- dard. Response time minimization and reliability maximization A. System Architecture and Models (i.e., exposure minimization) are conflicting in scheduling a Each application is implemented by a single ECU in an early DAG-based application, and optimizing them is a bicriteria op- automotive architecture where ECUs use point-to-point commu- timization problem as pointed out in Section I-A [10]. In [17], nication with a low degree of coupling. The next federated archi- Zhao et al. presented a shared recovery-based frequency as- tecture enables the exchange of information between different signment technique to reduce energy consumption with a re- applications through the network. New integrated architecture liability requirement and a response time requirement for a is the mainstream architecture of today’s automobiles [18], [19]. DAG-based distributed application on a single processor. How- We consider a distributed integrated architecture where several ever, multiprocessors have been used in most embedded sys- processors are mounted on the same controller area network tems, such as automotive embedded systems. In [8], Xie et al. (CAN) bus, as shown in Fig. 3 [6]. A task executed completely solved the problem of minimizing the resource consumption in one ECU sends messages to all its successor tasks, which cost by satisfying the reliability requirement without using fault- may be located in the different ECUs. For example, task n1 is tolerance on heterogeneous embedded systems. In [11], Xie executed on ECU u1 . It then sends a message m1,2 to its suc- et al. solved the problem of minimizing the redundancy by cessor task n2 located in u6 . U = {u1 ,u2 , ..., u|U |} represents satisfying the reliability requirement using fault-tolerance on a set of heterogeneous ECUs, where |U| represents the size of heterogeneous service-oriented systems. Considering the lim- set U. Note that for any set X, this study uses |X| to denote its ited resources of embedded systems, fault-tolerance may be size. unsuitable [8]. A distributed application is represented by a DAG G =(N, W , M, C) [1], [20], where the related parameters are described III. MODELS as follows. Table I lists notations and their definitions that are used in 1) N represents a set of tasks in G, and ni ∈ N repre- this study. sents the ith task of G.pred(ni ) represents the set of the XIE et al.: FAST FUNCTIONAL SAFETY VERIfiCATION FOR DISTRIBUTED AUTOMOTIVE APPLICATIONS DURING EARLY DESIGN PHASE 4381
TABLE II WCETSOFTASKS ON DIFFERENT ECUSOFTHEMOTIVATING DISTRIBUTED APPLICATION [8], [20]
Task u1 u2 u3
n1 14 16 9 n2 13 19 18 n3 11 13 19 n4 13 8 17 n5 12 13 10 n6 13 16 9 n7 71511 n8 51114 n9 18 12 20 n10 21 7 16 Fig. 4. Motivating example of a DAG-based distributed application with ten tasks [8], [20].
weight 14 of n1 and u1 in Table II represents WCET of n1 on immediate predecessor tasks of ni , whereas succ(ni ) rep- u1 , denoted by w1,1 = 14. The same task has different WCETs n resents the set of the immediate successor tasks of i .The on different ECUs due to the heterogeneity of the ECUs. The n task with no predecessor task is denoted by entry, whereas motivating example will be used to explain the proposed ver- n the task with no successor task is denoted by exit. Con- ification methods in the paper. For simplicity, all units of all sidering an automotive application (e.g., brake-by-wire) parameters are ignored in the example. may be released by receiving collected data from multi- ple sensors and is completed by sending the performing B. Reliability Model action to multiple actuators, multiple nentry or multiple nexit tasks may exist. To adapt to the application model There are two major temporal types of failures, namely, the with only one entry task and one exit task, a dummy entry transient failure (i.e., random hardware failures) and the per- or exit task with zero-weight dependencies is added to the manent failure [8], [10]. This study only considers the transient graph in this case. failure of ECUs because the automotive functional safety stan- 2) W is a |N|×|U| matrix, where wi,k denotes the dard ISO 26262 only combines the random hardware failures worse-case execution time (WCET) of ni on the ECU and reliability together [2]. ISO 26262 specifies that random uk . Each task ni ∈ N has different WCET values on hardware failures occur unpredictably during the life cycle of a different ECUs due to the heterogeneity of ECUs [21]. hardware but follows a probability distribution [2]. In general, All the WCETs of the tasks are determined through anal- transient fault for a task in a DAG-based distributed application ysis methods performed (i.e., WCET analysis) during the follows the Poisson distribution [8], [10]. The reliability of an analysis phase [1]. event in unit time t is denoted by R (t)=e−λt , where λ is the 3) The communication between tasks mapped to different constant failure per time unit (i.e., failure rate) for an ECU [16]. ECUs is performed through message passing over the bus. We let λk represent the constant failure rate per time unit of Hence, M is a set of communication edges, and each edge the ECU uk , and the reliability of ni executed on uk and its mi,j ∈ M represents the communication message from execution time is denoted by ni to nj . Accordingly, ci,j ∈ C represents the worst-case −λk w i,k response time (WCRT) of mi,j [1]. All the WCRTs of R (ni ,uk )=e . (1) the messages are also determined through analysis meth- ods performed (i.e., WCRT analysis) during the analysis The reliability of the DAG-based distributed application is cal- phase [1]. culated by [8], [10] Scheduling in automotive systems can be either preemptive R(G)= R(n ,u ) (e.g., OSEKTime) or nonpreemptive (e.g., eCos) [1]. Consider- i pr(n i ) (2) ing that many DAG-based distributed application scheduling n i ∈ N algorithms generally use nonpreemptive scheduling [1], [8], u n where pr(n i ) represents the assigned ECU of i .AsCANbus [20], we consider nonpreemptive scheduling for ECUs in this has high fault-tolerance capacity, this study only considers ECU study. Of course, the method of this paper can also be applied failure and does not factor the communication failure into the to preemptive scheduling. problem (i.e., communication is assumed reliable). Fig. 4 shows a motivating distributed application with tasks and messages [8], [20]. The example shows ten tasks executed C. Reliability Requirement Assessment on three ECUs {u1 ,u2 ,u3 }. The weight 18 of the edge between n1 and n2 represents WCRT, denoted by c1,2 if n1 and n2 are As the WCET of each task on each ECU has been deter- not assigned to the same ECU. Table II presents the WCET mined by the WCET analysis method during the analysis phase, matrix |N|×|U| of tasks on different ECUs. For example, the the maximum reliability value of task ni can be obtained by 4382 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 65, NO. 5, MAY 2018
TABLE III UPWARD RANK VALUES FOR TASKS OF THE MOTIVATING DISTRIBUTED APPLICATION
Task n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
ranku 108 77 80 80 69 63.3 42.7 35.7 44.3 14.7 (ni )
Fig. 5. Task scheduling for lower bound calculation. traversing all the ECUs, and it is calculated by R (n )=maxR(n ,u ). max i i k (3) ni nj u (ni ) > u (nj ) u k ∈U that two tasks and satisfy rank rank ; if no precedence constraint exists between ni and nj , As the reliability of application G is the product of reliability ni does not necessarily take precedence for nj to be values of all the tasks, [see (2)], the maximum reliability value assigned. Therefore, the task assignment order in G is of application G is calculated by {n1 ,n3 ,n4 ,n2 ,n5 ,n6 ,n9 ,n7 ,n8 , and n10}. 2) Response time minimization. The attributes EST(ni ,uk ) Rmax(G)= Rmax (ni ). (4) and EFT(ni ,uk ) represent the earliest start time (EST) n i ∈N and EFT, respectively, of task ni on ECU uk .EFT(ni ,uk ) The reliability requirement Rreq(G) must be less than or is considered the task allocation criterion because it can equal to Rmax(G). Hence, Rreq(G) must have the following meet the local optimal of each task. The aforementioned constraint: attributes are calculated as follows: ⎧ ⎪ R (G) Rmax(G) (5) ⎪ req ⎪ ⎨⎪ (n ,u )=0; otherwise, the reliability requirement assessment cannot be EST entry k ⎧ ⎫ ⎪ ⎪ [k], ⎪ passed. ⎪ ⎨⎪ avail ⎬⎪ u ⎪ (n ,u )=max For example, we assume that the failure rates of ECUs 1 , ⎩⎪EST i k ⎪ ⎪ ⎩⎪ max {AFT(nh )+ch,i} ⎭⎪ u2 , and u3 are λ1 =00.0005,.0002, λ2 =00.0002,.0005, and λ3 =0.0009,re- n h ∈ pred(n i ) spectively. We can calculate that the maximum reliability value (7) of the application is Rmax(G)=0.974335. We let reliability and requirement Rreq(G)=0.96, which can pass the assessment, EFT(ni ,uk )=EST(ni ,uk )+wi,k. (8) because avail[k] is the earliest available time when ECU uk is 0.96 0.974335 ready for task execution. AFT(nh ) is the actual finish according to (5) of the reliability requirement assessment. time (AFT) of task nh . ch,i represents the WCRT between nh and ni .Ifnh and ni are allocated to the same ECU, D. Response Time Requirement Assessment then ch,i =0; otherwise, ch,i = ch,i. ni is allocated to the ECU with the minimum EFT by using the insertion- Considering the problem of mapping tasks to multiple pro- based scheduling strategy, where ni can be inserted into cessors (ECUs) is NP-hard [22], obtaining a minimum end-to- the slack with the minimum EFT. end response time of a distributed automotive application is an The response time requirement assessment processes are as NP-hard optimization problem [20]. The heterogeneous EFT follows. (HEFT) algorithm is a widely accepted heuristic list-scheduling 1) The response time obtained by the HEFT algorithm rep- algorithm for minimizing response time [20], and it is applied resents the application’s lower bound. The lower bound for the response time requirement assessment of distributed au- means the minimum response time of a distributed ap- tomotive applications [1]. The two-phase HEFT algorithm has plication by the HEFT algorithm and is calculated as the following important steps. follows: 1) Prioritizing task. The HEFT algorithm uses the upward rank value (ranku ) of a task [see (6)] as the task priority LB(G)=AFT(nexit) (9) standard. In this case, the tasks are arranged according to where nexit represents the exit task as mentioned earlier. the descending order of ranku , which is obtained by Fig. 5 shows the scheduling of lower bound calculation ranku (ni )=wi +max{ci,j + ranku (nj )} (6) of the motivating example, where LB(G)=80. Note that n ∈ (n ) j succ i the arrows in Fig. 5 represent the generated communica- where wi represents the average WCET of task ni . tions between precedence constrained tasks. Table III shows the upward rank values of all the tasks in 2) A known response time requirement RTreq(G) (i.e., dead- Fig. 4. Note that only if all the predecessors of ni have line) is then provided for the application on the basis of been assigned will ni prepare to be assigned. We assume the actual physical time requirement after hazard analysis XIE et al.: FAST FUNCTIONAL SAFETY VERIfiCATION FOR DISTRIBUTED AUTOMOTIVE APPLICATIONS DURING EARLY DESIGN PHASE 4383