Evaluation of Message Missing Failures in FlexRay-based Networks with Star Topology

Abstract FlexRay communication protocol [10]. The FlexRay allows the sharing of the among event-triggered This paper evaluates the error propagation and its and time-triggered messages, thus offering the effects in message missing in a FlexRay-based network advantages of both protocols. It is reported that the with star topology. The evaluation is based on about FlexRay will very likely become the de-facto standard 35680 bit-flip fault injections inside different parts of for in-vehicle communications [5] [11]. The FlexRay the FlexRay communication controller. To do this, a defines a communication cycle (bus cycle) as the FlexRay communication controller is modeled by combination of a time-triggered (or static) window, an Verilog HDL at the behavioral level and is exploited to event-triggered (or dynamic) window, a symbol setup a FlexRay-based network composed of four window and a network idle time (NIT) window. The nodes. The results of fault injection show that about time-triggered window is similar to TTP, and employs 39% of faults lead to the message missing failures. The a time-division multiple-access (TDMA) mechanism. clock synchronization process and the controller host The event-triggered window of the FlexRay protocol is interface of the FlexRay were the most sensitive to the similar to Byteflight protocol and uses a flexible message missing failure. The coding and decoding unit TDMA (FTDMA) bus access method. The symbol of the FlexRay was the least sensitive to this failure. window is a communication period in which a symbol can be transmitted on the network. The NIT window is a communication-free period that specifies the end of 1. Introduction each communication cycle. The importance of safety in critical distributed Safety in distributed systems such as automotive applications signals to pay specific attention to the systems and avionics is of decisive importance due to reliability of communication protocols. One way to system failures which may threat human life. In a evaluate the reliability of communication protocols is distributed system, each node consists of three parts by fault injection to assess the vulnerability of such [1]: 1) I/O part, 2) host part, and 3) communication protocols. The fault injection techniques can be controller. Among these three parts, the classified into two main categories [12]: 1) hardware- communication controller has a key role in the based fault injection [13], and 2) software-based fault distributed systems operation. injection [14]. The latter can in turn be divided into In general, communication activities can be triggered software-implemented fault injection (SWIFI) and either dynamically, in response to an event (event- simulation-based fault injection [14]. In simulation- triggered), or statically, at predetermined moments in based fault injection, faults are injected into the time (time-triggered). Examples of time-triggered simulation model of circuits using HDL languages [14] protocols are the SAFEbus [2], SPIDER [3], and Time- or other languages such as C++ [15]. Triggered Protocol (TTP) [4]. The main drawback of In [16], a simulation-based fault injection has been the time-triggered protocols is their lack of flexibility used for the assessment of message missings in the [5]. Examples of event-triggered protocols are the CAN protocol. Effects of masquerade failures have Byteflight [6] introduced by BMW Company for been investigated using a simulation-based fault automotive applications, CAN [7], LonWorks [8] and injection in the CAN protocol [17]. Evaluation of [9]. The main drawback of the event-triggered TTP/C communication controller by heavy-ion fault protocols is their lack of predictability. A large injection (hardware-based fault injection) has been consortium of automotive manufacturers and suppliers performed in [18]. The purpose of the experiments in has proposed a hybrid type of protocol, namely, the that paper was to validate the fail silence property of the TTP/C by injecting faults in a single node. The 2. FlexRay Protocol Structure relationship between the number of nodes in a cluster and the slightly-off-specification (SOS) failures has The FlexRay protocol controller consists of six been assessed using heavy-ion fault injection [19]. In parts: controller host interface (CHI), protocol [20], the TTP/C protocol with bus and star topologies operation control (POC), coding and decoding has been investigated using SWIFI fault injection. (CODEC), media access control (MAC), frame and Here, the effects of the SOS failures in the bus and star symbol processing (FSP) and clock synchronization topologies with respect to the start of frame process (CSP). transmission have been studied. In [21] [22], a generic The CHI manages data and control flow between tool was developed for monitoring and diagnosis of a the host processor and the FlexRay protocol engine FlexRay-based system as well as for a CAN-based within each node. The CHI contains two major system. This tool has been used by the FlexRay interface blocks: the protocol data interface and the consortium to perform extended fault injection for message data interface. The protocol data interface evaluating of the FlexRay communication protocol. manages all data exchange relevant to the protocol One important limitation of this tool is that faults operation and the message data interface manages all cannot be injected inside different parts of the FlexRay data exchange relevant to the exchange of messages. protocol. The protocol data interface manages the protocol This paper evaluates the error propagation and configuration data, the protocol control data, and the message missing failures in the FlexRay protocol with protocol status data. The message data interface star topology. It evaluates the conditions that faults in manages the message buffers, the message buffer the FlexRay protocol disturb the sending or receiving configuration data, the message buffer control data, of messages at a node and cause a message does not and the message buffer status data. In addition, the CHI send or receive. In this condition a message missing provides a set of services that define self-contained failure occurs. This evaluation is done by 35680 bit- functionality that is transparent to the operation of the flip fault injection inside different parts of the FlexRay protocol [10]. protocol. To do this, a FlexRay communication The core parts of the protocol are moded by POC. controller was modeled by Verilog HDL at the Proper protocol behavior can only occur if the mode behavioral level. A FlexRay-based network composed changes of the core parts are properly coordinated and of four nodes was established using this controller by synchronized. The purpose of the POC is to react to star topology. The evaluations are done in two phases, host commands and protocol conditions by triggering at the first phase the percentages of faults resulting in coherent changes to core parts in a synchronous three kinds of errors, namely, content errors, syntax manner, and to provide the host with the appropriate errors and boundary violation errors are characterized. status regarding these changes [10]. The most sensitive and the less sensitive points of the The CODEC contains two sections: coding section FlexRay protocol to faults are identified. Then in the and decoding section. Coding section is responsible for second phase, by considering the error propagation encoding the communication elements into a bit stream results, the message missing failures are evaluated. In and how the transmitting node represents this bit this phase the relationship between the error stream to the bus driver for communication onto the propagation and message missing failure results are physical media. Decoding section is responsible for analyzed. Also, the message missing failure rate that receiving communication elements, make bit streams occurs in time-triggered or event-triggered window of and investigate correctness of bit streams. the FlexRay communication cycle are assessed. The The MAC controls access to the bus. In the FlexRay dependencies of fault locations (FlexRay parts) to this protocol, media access control is based on a recurring failure are also assessed. communication cycle. Within one communication This paper is organized in six sections. Section 2, cycle, FlexRay offers the choice of two media access introduces the FlexRay protocol, and section 3 presents schemes. These are a TDMA scheme and a FTDMA the message missing failures and error models found in scheme. The communication cycle is the fundamental this protocol. The experimental organization is given in element of the media access scheme within FlexRay. It section 4, and the results are presented in section 5. contains the static segment, the dynamic segment, the The last section concludes the work. symbol window and the NIT [10]. The FSP is the main processing layer between CODEC and CHI. This part checks the correct timing of received frames and symbols with respect to the TDMA scheme, applies further syntactical tests to received frames, and checks the semantic correctness receiving the messages from communication of received frames [10]. controller. Meanwhile, the host generates the message Finally, the CSP uses a distributed clock exactly as many as the IDs that has been allocated to its synchronization mechanism in which each node controller. It means that the number of generated individually synchronizes itself to the cluster by messages form host is equal to number of IDs that is observing the timing of transmitted sync frames from allocated to the controller. other nodes. Also, the CSP is responsible for generating Microticks, Macroticks and Cycles. 4. Experimental Organization 3. Error models and message missing failures This section discusses the basic characteristics of the experiment. In this section the error models and failure model will be discussed. 4.1 Experimental setup

3.1 Error models In order to perform an experiment on the FlexRay controller a network consisting of nodes that have this The FlexRay protocol has different mechanisms for controller should be set upped. So, a model of the detecting errors in the controller. At the end of each FlexRay controller has been implemented at the time slot, FSP part checks the presence of any error in behavioral level according to the FlexRay protocol that slot and informs the host about it. This protocol specification [10]. This controller has been defines 3 main errors that can occur in each slot: implemented by hardware description language, syntax error, content error and boundary violation Verilog, and Modelsim 6.1 simulator. This FlexRay error. The syntax error denotes the presence of a controller has been tested according to the FlexRay syntactic error in a time slot, the content error denotes protocol conformance test specification [23]. the presence of an error in content of a received frame The implemented controller has usual capabilities of and boundary violation error denotes whether a the FlexRay protocol such as sending and receiving the boundary violation occurred at boundary of the static and dynamic frames and symbols. This controller corresponding slot. according to the specifications in [10] has six parts to perform its functions: controller host interface (CHI), protocol operation control (POC), clock 3.2 Message missing failures synchronization process (CSP), frame and symbol process (FSP), media access control (MAC), coding Faults can affect the correct functionality of a and decoding (CODEC). In addition, instead of a real controller system and can result in destruction of application, a data generator is implemented to controlled system. Faults depend on when and where generate static frames with fixed length and dynamic occur, may cause different failures such as message frames with variable length at the start of the missing, babbling failures and masquerade failures. communication cycles. Faults can disturb the sending or receiving of After that, a cluster is formed consisting of 4 nodes messages at a node and cause a message does not sent with star topology. In order to forming a network with or received (message missing failure). In this paper the star topology, a model of FlexRay central bus guardian message missing failures are considered in two aspects: (CBG) is implemented at the behavioral level 1- Because of a fault in the communication controller according to the FlexRay central bus guardian of the sender node, a message that has been specification [24]. Then, the four nodes are connected prepared from the host of the sender node for to this CBG to form a network with star topology. Any sending would never be sent. node is allowed to send and receive frames on 2- Because of a fault in the communication controller communication channel. Faults are injected in node 2 of the sender node, a message that has been and their error propagation effects are observed in node prepared from the host of the sender node for 4. After each fault injection, the results in node 4 will sending, it would be sent incorrectly on the be saved. As discussed each node on this network network and won't been accepted in receiver consists of 3 main parts: Host that generates the nodes. frames, an interface between host and controller and at In this experiment we assumed that the host is fast lowest part there is communication controller (CC). In enough to generate the messages for sending, and this experiment, faults are injected in 5 parts of the communication controller of the node 2, including Table 1. Effect of fault injection in FlexRay parts CHI, POC, CSP, MAC and CODEC. The FSP part checks the correct timing of received frames with Boundary FlexRay No. of Syntax Errors Content Errors Violation respect to the TDMA scheme, applies further Parts Faults Errors syntactical tests to received frames, and checks the # % # % # % semantic correctness of received frames [10]. Thus, for CODEC 9300 626 6.73 0 0.00 0 0.00 the reason that the FSP part doesn’t have any role in MAC 4100 260 6.34 0 0.00 0 0.00 transmitting frames and error propagation to other CSP 12480 3955 31.69 6 0.05 22 0.18 nodes, there is no fault injection in the FSP part. The POC 2800 17 0.61 0 0.00 0 0.00 effects of fault injection are observed in communication controller of the node 4 by FSP part. CHI 7000 2223 31.76 0 0.00 0 0.00 All 35680 7081 19.85 6 0.02 22 0.06 Parts 4.2 Fault injection tool the first phase the errors propagation in FlexRay- The SINJECT fault injection tool [17] is used for based network with star topology are investigated. In injecting fault in nodes, collecting the results, and the second phase, the message missing failure in this analyzing them. network and its relation to the error propagation is A fault injection process usually consists of three assessed. steps: 1- When the given workload is applied, the behavior of a fault-free network is 5 . 1 E rro r propagation evaluation computed and stored. 2- During the second step, to consider faults The FlexRay protocol defines 3 main error models: effects, the given workload are applied again content error, syntax error and boundary violation to the network, the fault is injected, and the error. In this phase after injecting the faults inside the behavior of the network is observed. communication controller of node 2, the errors that 3- During the third step of the fault injection occur in node 4 are observed. Table 1 shows the results process, the faulty network behavior is of this experiment. compared with the behavior of the fault-free As this table illustrates, the content errors and network, which is gathered at first step, and boundary violation errors are rarely propagated in the therefore the fault effects are specified and network because they are eliminated in CBG. The saved. CBG disconnects the transmitter node when it observes

one of these errors. Thus, after disconnecting the 5. Experimental results transmitting node, instead of these errors (content errors and boundary violation errors) in receiver node, As discussed, for doing this experiment a network the syntax errors occur. The CSP and CHI parts cause consisting of four nodes was set upped. The slot IDs most syntax errors. were allocated between the different nodes in such manner that slot IDs with number 3 and 5 in static 5.2 Message missing failures evaluation window and slot IDs with number 7 and 9 in dynamic window were allocated to node 2. So, this node was In this phase, the message missing failure is sending two messages in static window and also could evaluated as the result of fault injection in different send two messages in dynamic window in the event- parts of the FlexRay communication controller. As triggered manner. Afterwards, totally 35680 bit-flip discussed, node 2 sends two messages in static window faults were injected in five different parts of the and randomly sends two messages in dynamic window. communication controller of node 2. These five parts Totally it sends 9 messages during 3 communication included: CHI, CSP, MAC, POC, and CODEC. cycles of each experiment (6 messages in static Each experiment last for 3 communication cycles, in window and 3 messages in dynamic window). cycle 1 the faults were injected and the effects of them Table 2 shows the message missing in this network. observed in cycle 1 through 3. In each communication In this table, the message missing failures are occurred cycle 6 slot IDs in static window and 6 slot IDs in in two ways: 1) the messages that are not sent by node dynamic window was allocated to different nodes. 2 as a result of fault injection, and 2) the messages that In this section the results of these experiments are are not received correctly in receiver nodes. As this evaluated. The evaluations are done in two phases. In table shows, the unsent messages have more Table 2. Total message missing failures

No. of experiments Total message missings FlexRay No. of including failure Total Parts Faults messages Total invalid messages Total unsent messages # % # % # % CODEC 9300 1068 11.48 73700 1041 1.41 1963 2.66 MAC 4100 998 24.34 36900 379 1.03 4084 11.07 CSP 12480 7395 59.25 112320 9542 8.50 27010 24.05

POC 2800 1013 36.18 25200 16 0.06 5482 21.75 CHI 7000 3691 52.73 63000 5276 8.37 10050 15.95 All Parts 35680 14165 39.70 311120 16254 5.22 48589 15.62

Invalid Messages Unsent Messages

30 percentage in message missing failures. The CSP 25 and the CHI parts cause the most message missing 20 failures. The results of POC part are interesting; although this part generates trifle errors in table 1, but 15 fault injection in this part causes much amount of 10 message missing failures. Much of these failures are 5 the messages which are not sent by node 2 because of 0 Me s a g e M i ss n Ra t ( %) CODEC MAC CSP POC CHI All Parts modes changing that occur in this node after fault injection in the POC part. FlexRay Parts In figure1, the message missing failures that occur in static window are shown. In this window, the CSP and Figure 1. Message missing failures that occur in CHI parts lead to the most message missing failures. static window The results of figure 2 shows in dynamic window the CSP and CHI parts lead to the most message missing Invalid Messages Unsent Messages failures; like static window. By considering the results of these two figures, the rate of message missing 25 failures occurrence in static window is more than 20 dynamic window. About 22% of messages that are sent 15 in static window lead to message missing failures and about 18% of messages that are sent in dynamic 10 window lead to message missing failures. 5 This vulnerability of the static messages can be 0 analyzed from two aspects. Firstly, since the length and M es s a g e i ng R t ( %) CODEC MAC CSP POC CHI All Parts other parameters of the static messages are entirely FlexRay Parts fixed, any change in these parameters can lead to errors and fails the message. Secondly, because of the error Figure 2. Message missing failures that occur in prevention feature of the CBG, this device disconnects dynamic window the sender node after detection of the content errors and boundary violation errors. Thus, in receiver nodes a syntax error occurs and the message will be failed. 6. Conclusions Totally the content error and boundary error occurrence rate in static window is more than dynamic This paper evaluated the error propagation and its window, so there is much amount of message loss in effects in message missing in a FlexRay-based network static window. with star topology. The evaluation was based on about 35680 bit-flip fault injections inside different parts of the FlexRay communication controller. The evaluations were done in two phases, at the first phase the percentages of faults resulting in three kinds of errors, namely, content errors, syntax errors and [14] H. R. Zarandi, S. G. Miremadi, and A. Ejlali, boundary violation errors were characterized. Then in “Dependability Analysis Using a Fault Injection Tool second phase, by considering the error propagation Based on Synthesizability of HDL Models”, Proc. of the results, the message missing failures were evaluated. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 485-492, Boston, 2003. this phase the relationship between the error [15] K. K. Goswami, R. Iyer, and I. Young, “DEPEND: A propagation and message missing failure results were Simulation-Based Environment for System Level analyzed. Also, the message missing failure rate that Dependability Analysis”, IEEE Trans. On Computers, occurred in time-triggered or event-triggered window vol. 46, no. 1, pp. 60-74, January 1997. of the FlexRay communication cycle, were assessed. [16] H. Salmani, and S. G.Miremadi, “Assessment of About 22% of messages that were sent in static Message Missing Failures in CAN-based Systems”, window led to message missing failures and about 18% Proc. of the Parallel and Distributed Computing and of messages that were sent in dynamic window led to Networks, pp. 387-392, 2005. message missing failures. [17] H. Salmani, and S. G. Miremadi “Contribution of Controller Area Networks Controllers to Masquerade Failures”, Proc. of the 11th Pacific Rim International Symposium on Dependable Computing, pp. 310- 316, References 2005. [18] H. Sivencrona, P. Johannessen, M. Persson, and

J. Torin, “Heavy-ion Fault Injections in the Time- [1] H. Kopetz, “A Comparison of CAN and TTP”, Vienna triggered Communication Protocol”, Proc. of the Latin University of Technology, Real-Time System Group, American Symposium on Dependable Computing, pp. Research Report 23/1998. 69-80, 2003. [2] K. Hoyme, and K. Driscoll, “SAFEbus”, The IEEE [19] H. Sivencrona, M. Persson, and J. Torin, “Using Heavy- Aerospace and Electronic Systems Magazine, vol. 8, no. Ion Fault Injection to Evaluate Fault Tolerance with 3, pp. 34-39, 1992. Respect to Cluster Size in a Time-Triggered [3] P. S. Miner, “Analysis of the SPIDER Fault-Tolerance Communication Systems”, Proc. of the IEEE Protocols”, Proc. of the 5th NASA Langley Formal International Workshop on Design and Diagnostics of Methods Workshop, 2000. Electronic Circuits and Systems (DDECS-06), pp. 171- [4] H. Kopetz, and G. Bauer, “The Time-Triggered 176, April 2003. Architecture”, Proceedings of the IEEE, vol. 91, no. 1, [20] A. Ademaj, H. Sivencrona, G. Bauer, and J. Torin, pp. 112-126, 2003. “Evaluation of Fault Handling of the Time-Triggered [5] T. Pop, P. Pop, P. Eles, Z. Peng, and A. Andrei, “Timing Architecture with Bus and Star Topology”, Proc. of the Analysis of the FlexRay Communication Protocol”, International Conference on Dependable Systems and Proc. of the 18th Euromicro Conference on Real-Time Networks, pp. 123-133 June 2003. System, pp. 203-216, July 2006. [21] R. Pallierer, M.Horauer, M. Zauner, A. Steininger, E. [6] J. Berwanger, M. Peller, and R. Griessbach, “Byteflight- Armengaud, and F. Rothensteiner, “A Generic Tool for A New High Performance Data Bus System for Safety- Systematic Tests in Embedded Automotive Related Applications”, BMW 2000, available in Communication Systems”, Proc. of the Embedded http://www.byteflight.de. World Conference, 2005. [7] R. Bosch GmbH, “CAN Specification”, v2.0, 1991. [22] E. Armengaud, F. Rothensteiner, A. Steininger, and M. [8] Echelon, and LonWorks, “The LonTalk Protocol Horauer, “A Method for Bit Level Test and Diagnosis of Specification”, available in http://www.echelon.com. Communication Services”, Proc. of the IEEE Workshop [9] Profibus International, “PROFIBUS DP Specification”, on Design & Diagnostics of Electronic Circuits & available in http://www.profibus.com. Systems, 2005. [10] FlexRay Consortium, “FlexRay Communications [23] FlexRay Consortium, “FlexRay Communications System - Protocol Specification”, v2.1 Revision A, System - Protocol Conformance Test Specification,” December 2005. v2.1, December 2005. [11] N. Navet, Y. Song, F. Simonot-Lion, and C. Wilwert, [24] FlexRay Consortium, “FlexRay Communications “Trends in Automotive Communication Systems”, System - Preliminary Central Bus Guardian Proceedings of the IEEE, vol. 93, no. 6, June 2005. Specification”, v2.0.9, December 2005. [12] P. Folkesson, P., S. Svensson, and J. Karlsson, “A

Comparison of Simulation Based and Scan Chain Implemented Fault Injection”, Proc. of the 28th International Symposium On Fault-Tolerant Computing (FTCS 28), pp.284-293, 1998. [13] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J. C. Febre, J. C. Laprie, E. Martins, and D. Powell, “Fault Injection for Dependability Validation: A Methodology and Some Applications”, IEEE Trans. on Software Engineering, February 1990.