Signature-based detection of behavioural deviations in flight simulators - Experiments on FlightGear and JSBSim

Vincent Boisselle 1 , Giuseppe Destefanis Corresp., 2 , Agostino De Marco 3 , Bram Adams 1

1 Computer Science, Polytechnique Montréal, Montreal, Quebec, Canada 2 Computer Science, CRIM Montreal - Brunel University, London, United Kingdom 3 Dipartimento di Ingegneria industriale, University of Naples Federico II, Napoli, Italy

Corresponding Author: Giuseppe Destefanis Email address: [email protected]

Flight simulators are systems composed of numerous off-the-shelf components that allow pilots and maintenance crew to prepare for common and emergency flight procedures for a given aircraft model. A simulator must follow severe safety specifications to guarantee correct behaviour and requires an extensive series of prolonged manual tests to identify bugs or safety issues. In order to reduce the time required to test a new simulator version, this paper presents rule-based models able to automatically identify unexpected behaviour (deviations). The models represent signature trends in the behaviour of a successful simulator version that are compared to the behaviour of a new simulator version. Empirical analysis on nine types of injected faults in the popular FlightGear and JSBSim open source simulators shows that our approach does not miss any deviating behaviour considering faults which change the flight environment, and that we are able to find all the injected deviations in 4 out 7 functional faults and 75% of the deviations in 2 other faults.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 1 Signature-based detection of behavioural deviations in

2 flight simulators - Experiments on FlightGear and

3 JSBSim 4

5 Vincent Boisselle

6 Polytechnique Montreal 7 [email protected]

8 Giuseppe Destefanis

9 CRIM, Montreal - Brunel University London 10 [email protected]

11 Agostino De Marco

12 University of Naples Federico II 13 [email protected]

14 Bram Adams

15 Polytechnique Montreal 16 [email protected]

17 Abstract

18 Flight simulators are systems composed of numerous off-the-shelf components 19 that allow pilots and maintenance crew to prepare for common and emer- 20 gency flight procedures for a given aircraft model. A simulator must follow 21 severe safety specifications to guarantee correct behaviour and requires an 22 extensive series of prolonged manual tests to identify bugs or safety issues. 23 In order to reduce the time required to test a new simulator version, this 24 paper presents rule-based models able to automatically identify unexpected 25 behaviour (deviations). The models represent signature trends in the be- 26 haviour of a successful simulator version that are compared to the behaviour 27 of a new simulator version. Empirical analysis on nine types of injected faults 28 in the popular FlightGear and JSBSim open source simulators shows that 29 our approach does not miss any deviating behaviour considering faults which

Preprint submitted to PeerJ Computer Science December 23, 2016

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 30 change the flight environment, and that we are able to find all the injected 31 deviations in 4 out 7 functional faults and 75% of the deviations in 2 other 32 faults.

33 Keywords: behavioural deviation, performance regression testing, flight 34 simulator

35 1. Introduction

36 Aircraft simulators [3] reproduce common scenarios such as “take-off”, 37 “automatic cruising” and “emergency descent” in a virtual environment [44]. 38 The trainees then need to act according to the appropriate flight procedures 39 and react in a timely manner to any event generated by the simulator [25]. 40 Since a simulator is the most realistic ground-based training experience that 41 these trainees can obtain, aircraft companies all invest significant amounts 42 of money into simulator training hours. 43 To ensure that a simulator is realistic, it must undergo onerous quali- 44 fication procedures before being deployed. These qualification procedures 45 [1, 24, 39] need to demonstrate that the simulator behaviour represents an 46 aircraft with high fidelity, including an accurate representation of perfor- 47 mance and sound. There are four levels of training device qualification, from 48 A to D, as described by the Federal Aviation Administration (FAA) with 49 one hour of flight training in a level D training device being recognized as 50 one hour of flight training in a real aircraft. Since re-qualification is mostly 51 manual (requiring to book actual pilots), the total process can last from a 52 week to several months. 53 The high cost of training device qualification derives partly from the fact 54 that a simulator training device is composed of sophisticated off-the-shelf 55 components from 3rd party providers (aircraft manufacturers for instance) 56 that need to be treated as black-boxes during the integration process. Each 57 provider of a COTS can use its own format to describe the interface of the 58 COTS, yet the producer of the simulator needs to integrate all components 59 in a coherent way without access to the components’ [35, 53]. 60 Upgrades of a single component (engine, hydraulic component, electrical sys- 61 tem) could lead to a complete flight simulator re-qualification process. 62 Re-qualification is an important problem to tackle since the process is very 63 expensive and it is one of the main sources of expenses for flight simulator 64 companies. Upgrades, new versions, and new functionality are continuously

2

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 65 delivered from COTS providers to flight simulator companies. The changes 66 these components can introduce can vary from different output values (dif- 67 ferent thresholds, different unit of measures) to completely new interfaces. 68 The main challenge is to avoid a complete re-qualification of the whole flight 69 simulator, by focusing the attention only to a single or a specific set of com- 70 ponents. Since the avionics domain lacks robust tool support for automating 71 tasks such as analysis of COTS interactions, change propagation [27], test 72 selection and test prioritization, even a small modification to a COTS still 73 requires (manual) re-qualification of the whole system [18] [31] [34] [50]. 74 What further hampers the automation of these tests is the large amount 75 of data gathered during the qualification tests. Despite the safety-critical 76 nature of these tests, the test data needs to be analyzed manually which is 77 not only tedious but also risky, since any missed anomaly potentially can 78 be life-threatening as incorrect behaviour of the simulator could incorrectly 79 train pilots and crew members and cause tragic accidents [46]. Since the 80 analysis of large data to identify deviating trends compared to previous ver- 81 sions of a software system are common to other domains as well, such as 82 the analysis of software performance test results, this paper adapts a state- 83 of-the-art technique for software performance analysis [22] to automatically 84 detect whether a new simulator version exhibits behaviour that is out of 85 the ordinary. Instead of building a signature of normal system performance 86 based on performance-related metrics, then comparing the signature to the 87 metrics of a new test run to find performance deviations, our signatures are 88 rules that describe the normal functional behaviour of a simulated aircraft 89 observed in successful versions of a flight simulator, with deviations of these 90 rules indicating deviating behaviour of an aircraft. Instead of collecting soft- 91 ware performance metrics such as CPU utilization, we need to collect specific 92 flight metrics that are related to the aircraft’s functional behaviour. 93 For example, if the correct behaviour of a particular flight scenario on a 94 successful version of a simulator exhibits high speed when the engines run at 95 maximal capacity, there would be a signature rule “max. engine capacity 96 => high speed”. A deviation would then be a case where, for a new 97 version of the simulator, the aircraft suddenly would have a low speed while 98 the engines still run at the maximal capacity. Of course, our approach needs 99 to be robust against noise, where only during a short period a rule would be 100 violated. 101 After calibration of the approach, we perform an empirical study on the 102 FlightGear and JSBSim [6] open source simulators at different granularity

3

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 103 levels, in which we introduce 2 faults related to the flight environment and 7 104 faults related to flight behaviour to address the following research questions: 105 RQ1: What is the performance of the approach for faults that 106 change the flight environment? It is possible to have no false alarms for 107 one environmental fault, whereas flagging all deviations for a second envi- 108 ronmental fault is not possible without having 50% false alarms. 109 RQ2: What is the performance of the approach for functional 110 faults? 111 We are able to find all injected deviations in 4 out of 7 faults, and 75% of 112 the deviations in 2 other faults, yet especially for JSBSim the precision using 113 the initial thresholds is low (i.e., less than 44%).

114 2. Background and Related work

115 Flight simulators are tested like real aircraft. The pilot performing those 116 simulator tests takes the pilot seat and goes through a list of procedures, 117 specified in the Acceptance Test Manual, by performing different actions with 118 the cockpit control inputs. After each procedure, the tester manually checks 119 the cockpit indicators and verifies if their values correspond to the expected 120 outcomes. If further investigations are needed, testers are able to monitor 121 metrics within the flight simulator to collect information not available from 122 the cockpit. 123 Several studies analyzed the possibility of automatising tests for flight 124 simulators. Braun at al. [9] reported about automated fidelity test system 125 which compares flight test results and manual execution of flight tests in 126 simulators. Durak et al. [19] presented a metamodel for objective flight 127 simulator evaluation using model based testing approach.The authors devel- 128 oped an Experimental Frame Ontology (EFO) adopting experimental frames 129 from Discrete Event System Specification (DEVS), and as an upper ontology 130 to specify a formal structure for simulation test. Domain specific meta-test 131 definitions were captured in Objective Fidelity Evaluation Ontology. In an 132 other paper, Durak et al. [20] used Model-Based Testing approach for flight 133 simulator objective fidelity evaluation of engineering and research simula- 134 tors. Similarly, Wang et al. [48, 49] implemented an automated test system 135 for evaluating flight simulator fidelity (determined by direct comparison to 136 flight test). The system was designed to automatically test the performance 137 of flight simulators.

4

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 138 Although our paper does not aim to replace such time-consuming manual 139 tests, we aim to complement them with automatic detection of behavioural 140 deviations in between real pilot sessions. As such, for each component change 141 the simulator can be steered automatically (instead of manually) through the 142 required flight use cases, followed by automatic identification of deviations. 143 Use cases with detected deviations would then require more thorough manual 144 analysis by a human pilot, while use cases without deviations might need 145 less thorough human intervention. Hence, we do not aim to replace manual 146 testing, but rather improve prioritization of test procedures. 147 The detection of anomalies and behavioural deviations are problems that 148 have been studied within different areas and domains. Hodge et al. [29] sur- 149 veyed techniques for outlier detection developed in statistical domains and 150 machine learning, comparing their advantages and disadvantages in a com- 151 parative review. The authors considered five categories, namely outlier de- 152 tection, novelty detection, anomaly detection, noise detection, and deviation 153 detection. Their study is based on the definition of outlier by Barnett and 154 Lewis as “An observation that appears to be inconsistent with the remainder 155 of that set of data” [5]. Our work is closest to outlier detection, which Hodge 156 et al. [29] define as “detection of abnormal running conditions”. 157 Similarly, Chandola et al. [11] published a survey of research on anomaly 158 detection. The authors grouped the reviewed techniques into 6 categories 159 based on the approach adopted (i.e., Classification Based, Clustering Based, 160 Nearest Neighbor Based, Statistical, Information Theoretic, Spectral). The 161 authors also provide a discussion on the computational complexity of the 162 selected techniques. 163 Aggarwal [2] presented an abnormality detection algorithm for supervised 164 abnormality detection from multi-dimensional data streams. The algorithm 165 performs statistical analysis on multiple dimensions of the data stream and is 166 able to detect abnormalities with any amount of historical data. However, the 167 accuracy improves with progression of the stream (i.e., more data received). 168 Our work focuses on offline analysis, however future work could consider 169 stream algorithms. 170 The problem of deviation detection in flight simulators is similar to that 171 of regression detection on performance test results [12, 22, 32, 33, 36, 37, 52]. 172 Performance regression detection typically requires a system to run for a 173 long time, after which recorded logs or metrics need to be analyzed to find 174 deviations from the recorded logs and metrics of a previous release (baseline). 175 Such deviations correspond to performance regressions in a system. Whereas

5

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 176 such analysis for a long time had to be done manually by human experts, 177 several studies have used data mining approaches on the recorded data to 178 identify important deviations and rank them according to importance, after 179 which the human expert could then look at the most important deviations. 180 In particular, Foo et al. [22] used association rules to automatically de- 181 tect potential regressions in a performance regression test. Their approach 182 compares the results of a new test run to a set of rules built on successful 183 test runs (extracted from a performance regression testing repository). If the 184 new test run violates too many rules, it is tagged as a performance regression. 185 The result of the approach is a regression report ranking potential problem- 186 atic metrics that violate the extracted performance signatures. In other work 187 [23], the authors extended the previous approach to take into account tests 188 performed in a heterogeneous environment, i.e., tests executed with different 189 hardware and/or different software. 190 Nguyen et al. [37] proposed to match the behaviour of previous tests, 191 where a regression has occurred, to the behaviour of a new version. Af- 192 ter data filtering, regression detection and regression cause identification are 193 performed using machine learning techniques. The authors evaluated the ap- 194 proach on a commercial system and on an open-source system. The results 195 showed that the machine learning approach can identify the cause of perfor- 196 mance regression with up to 80% accuracy using a small number of historical 197 test runs. 198 Cohen et al. applied probabilistic models [13] to identify metrics and 199 threshold values that correlate with high-level performance states. The ap- 200 proach automatically analyzes data using Tree-Augmented Bayesian Net- 201 works to forecast failure conditions. In other work [14], the authors presented 202 a method based on supervised machine learning techniques for extracting 203 signatures able to identify when a system state is similar to a previously- 204 observed state. The technique allows operators to quantify the frequency of 205 recurrent issues. 206 Several works focused on mining execution logs (instead of metrics) to 207 flag anomalies from normal behaviour. Jiang et al. [32], presented a statis- 208 tical approach that analyzes the execution logs of a load test for detecting 209 performance problems. Syer et al. [47] proposed an automated statistical 210 approach combining execution logs and performance counters to diagnose 211 memory issues in load tests. 212 Beschastnikh et al. [8] developed a tool called CSight to mine logs from 213 a systems execution in order to infer a behavioural model in the form of

6

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 214 communicating finite state machines, from which engineers can then detect 215 anomalies and better understand their implementation. This work is a fur- 216 ther evolution of Daikon [21], a tool to detect relevant program invariants 217 from system traces. Program invariants are behavioural characteristics of a 218 system that have to persist across versions of a system. It can be used in 219 testing to look whether a system shows a regression. Our work differs from 220 both lines of work in terms of the input data: instead of detailed system 221 traces, we work on metrics that are sampled from observable component in- 222 puts and outputs. Furthermore, we build decision tree models instead of 223 state machines. 224 Other studies have used data mining approaches for issue prediction in 225 flight simulators. Hoseini et al. [30] proposed FELODE (Feature Location 226 for Debugging) to detect configuration errors, using a semi-automated ap- 227 proach that combines a single execution trace and user feedback to locate the 228 connections among modules that are the most relevant to an observed failure. 229 The authors have applied the approach to the CAE flight simulator system 230 (CAE is one of the largest commercial flight simulator producers), achieving 231 on average a precision of 50% and a recall of up to 100%. This paper is the 232 closest to our work, however we (1) focus on behavioural deviations instead 233 of configuration errors, (2) we do not consider user feedback, but aim for a 234 fully automated approach, and (3) we use specialized metrics picked using 235 domain knowledge to infer models.

236 3. Approach

237 This section presents our approach to automatically detect behavioural 238 deviations in a new version of a flight simulator. As mentioned earlier, the 239 approach is adapted from state-of-the-art machine learning approaches for 240 detecting regressions in performance tests [22]. Fig. 1 shows an overview 241 of our approach. The rest of this section explains the different steps of our 242 approach, using a running example comprising a baseline (correct) version of 243 a simulator and a deviating version.

244 3.1. Simulator Versions and Use Case 245 A baseline version of a simulator is used as reference to test the behaviour 246 of other versions of the system against. Usually, the latter versions are rela- 247 tively similar to the baseline, sharing a large proportion of the source code 248 and configuration files. For example, a baseline version could be the previous

7

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 (1 per output metric) Simulator Evaluation of Signature Flight Discretization RSD Signature Version N + 1 Data Deviations Deviations

RSD Use Case >Threshold

Simulator Flight Discretization Building Signatures Behavioural Version N Data Signatures Deviations ! (1 per output metric)

Figure 1: Overview of our approach.

249 release, a release with a specific bug or a release before a large restructuring. 250 In our running example, the baseline version is a correctly functioning ver- 251 sion, while the deviating version corresponds to the baseline version changed 252 in a specific way. 253 Having chosen two versions, we also need to identify representative use 254 cases (i.e., execution scenarios) based on which we will compare the behaviour 255 of both simulator versions. These use cases can correspond to the typical use 256 cases of a simulator like take-off, cruising and landing, or to more specific use 257 cases prescribed by an Acceptance Testing Manual. The running example in 258 this section uses a take-off use case.

259 3.2. Flight Data 260 During their operation, flight simulators generate vast amounts of metric 261 and log data at a configurable frequency, containing a mixture of numeric, 262 discrete, and boolean data. This data logs a pilot’s actions as well as the 263 simulator’s reactions, and is used for certification and for legal purposes. To 264 access this data, one typically can tap into the system’s public interface using 265 an external logger, or use the system’s built-in tools if available. Metric data 266 (where we focus on) is typically stored in CSV files with one column per 267 metric and one row per snapshot of metric values, i.e., the logger makes a 268 snapshot of all metric values at a given frequency. Studying textual log data 269 (instead of metric data) is outside the scope of this paper. 270 The metrics recorded during execution of a flight simulator can be divided 271 in two groups: input metrics measure characteristics of the cockpit instru- 272 ments and controllers used by the pilot, while output metrics observe the 273 resulting behaviour of the simulator. Classifying metrics as input or output 274 cannot be done automatically because it requires basic knowledge about the

8

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 Table 1: Running example’s flight data for the baseline and deviating versions, with data violating the signatures highlighted in blue and red, respectively.

input metric baseline version deviating version Throttle Discr. Throttle Airspeed Discr. Airspeed Airspeed Discr. Airspeed 0.0 low 0 low 0 low 0.1 low 18 low 18 low 0.2 low 36 low 36 low 0.3 low 54 low 54 low 0.4 low 72 low 72 low 0.5 medium 80 medium 76 low 0.6 medium 108 medium 80 low 0.7 medium 126 medium 90 medium 1.0 high 144 medium 108 medium 1.0 high 162 medium 126 medium 1.0 high 180 high 144 medium 1.0 high 180 high 162 medium 1.0 high 180 high 180 high 1.0 high 180 high 180 high 1.0 high 180 high 180 high

275 system behaviour. Internal metrics, i.e., metrics that are neither input nor 276 output are ignored in our study, since they are not observable externally. 277 In our running example (Table 1), we know that Throttle is an input 278 metric because Throttle drives the engine power, while the Airspeed metric 279 is an output that assesses the behaviour of the simulator. Note that we 280 record the same set of input metrics and output metrics for both analyzed 281 versions of the system.

282 3.3. Discretization 283 Once we have flight data collected from possibly multiple use case runs of 284 both versions of the system, we need to prepare it for the machine learning 285 techniques used in later steps. In particular, we should discretize the contin- 286 uous metric data (floating point numbers) into a limited number of discrete 287 values such as “low”, “medium”, and “high”. The idea is that a speed of 50 288 miles per hour is not that different from 51 miles per hour, while it is sub- 289 stantially faster than 25 miles per hour and blazingly faster than 3 miles per 290 hour. For this reason, considering we wish to discretize into three categories,

9

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 291 the discretization step could convert the first two values to the same discrete 292 value (“high”), while the second value could be mapped to “medium” and 293 third one to “low”. 294 To discretize numeric data, we used the popular equal frequency bin dis- 295 cretization algorithm [51], which clusters data in K bins, each of which ideally 296 containing 1/K of the data. Basically, this algorithm first sorts the data from 297 small to large values, then splits it as much as possible into K equally sized 298 bins. In our running example, the airspeed output of the baseline version 299 takes on the values “[0,18, 36, 54, 72, 90, 108, 126, 144, 162, 180, 180, 180, 300 180, 180]”, which the equal frequency algorithm set with K=3 bins would 301 split into bins “]-inf, 81]”, “]81, 171]” and “]171, inf[” that can be labeled 302 as “low”, “medium” and “high”. Table 1 shows the resulting discretized 303 values. Note that the algorithm puts all instances with the same value into 304 the same discretized bin, even if that results in a bin having more than 305 1/K of the data, which is what happens for the Throttle value “1.0” in Ta- 306 ble 1. Popular values of K are K=3 (low/medium/high) and K=5 (very 307 low/low/medium/high/very high), since those are easy to interpret by hu- 308 mans [51].

309 3.4. Building Signatures 310 The core of our approach is the learning of models (“signatures”) that 311 represent the essence of the behaviour of a basic simulator version for a par- 312 ticular use case and output metric. In principle, we can use any kind of 313 machine learning technique for this, by using the input metrics as indepen- 314 dent variables and one of the output metrics as dependent variable. We have 315 a preference for decision tree models because they provide a graphical ab- 316 straction that is easy to interpret, even for people without machine learning 317 background [51]. In addition, decision trees can easily be transformed into a 318 “rule” form that is straightforward to check automatically on new data. 319 Fig. 2 illustrates this on a decision tree signature for our running example, 320 with throttle as input and airspeed as output. Each non-leaf node (ellipse) of 321 the tree corresponds to a test in terms of one of the input metrics (here there 322 is only one input metric), while the leaf nodes (rectangles) correspond to the 323 value of the output metric suggested by the model based on the tests. Here, 324 the model states that Low Throttle corresponds to Low Airspeed, Medium 325 Throttle corresponds to Medium Airspeed and a High Throttle corresponds 326 to High Airspeed. We can summarize this tree into the three rules “Low 327 Throttle ⇒ Low Airspeed”, “Medium Throttle ⇒ Medium Airspeed”, and

10

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 Throttle

Low High Medium Low Medium High

Figure 2: Signature decision tree for our running example.

328 “High Throttle ⇒ High Airspeed”. Trees can have an arbitrary number of 329 test nodes, in which case the rules become more complex, for example “High 330 A and Medium B and High ⇒ High D”). 331 Standard algorithms for building decision trees are C4.5 and C5.0. They 332 typically try to reduce the depth of a tree, since more complex models tend 333 to overfit a given data set and hence would not generalize to other cases 334 (simulator versions), rendering them ineffective [51]. This means that in 335 practice the produced models for a given data set never will be 100% accurate. 336 Indeed, if the algorithm sees that 95% of the time x=4 yields y=5, it would 337 probably generate the rule “x=4 ⇒ y=5”, even though 5% of the data would 338 be classified incorrectly. Such rule violations can also occur when too few 339 or too many bins are used to discretize the data, accidentally putting data 340 points into the wrong bin. Table 1 shows that while transitioning from 0.0 341 Throttle to 1.0 (blue cells), our vertical speed is still considered as medium 342 in the baseline version (despite high throttle) due to the limitations of the 343 discretization used.

344 3.5. Evaluation of Signature Deviations 345 In order to determine whether a new simulator version contains behavioural 346 deviations for a given use case, we need to measure the degree to which the 347 new version’s metrics deviate from the baseline version’s signatures (for each 348 output metric). For this, we calculate for each signature the proportion 349 of incorrect classifications, yielding the Version Signature Deviation (VSD) 350 metric in Formula 1:

# Correctly classified snapshots VSD = 1 − (1) # Classified snapshots

351 In this formula, a “snapshot” corresponds to the metric values recorded 352 at a specific timestamp (i.e., one row in Table 1). The more snapshots are

11

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 353 classified correctly, the lower the VSD value. However, we cannot directly 354 use the VSD to determine behavioural deviations, since, as explained in the 355 previous section, signatures are never 100% accurate (to avoid overfitting). 356 This means that even the baseline version of the simulator has a VSD higher 357 than zero. For this reason, our approach uses the following relative signature 358 deviation (RSD) metric:

RSD = VSDNew Version − VSDBaseline Version (2)

359 For our running example, Table 1 highlights in blue or red the airspeed 360 snapshots that do not match the signature in the basic and deviating ver- 361 sions of the simulator. In both versions, the coloured snapshots observe 362 the rule “throttle=high ⇒ airspeed=medium” rather than “throttle=high 363 ⇒ airspeed=high” and the deviating version snapshots also observe the rule 364 “throttle=medium ⇒ airspeed=low” rather than “throttle=medium ⇒ air- 365 speed=medium”. For the baseline version, there are 13 rows that matched 366 the signature and 2 rows that did not match it, hence we obtained a VSD 367 of 1 - 13/15=13%, which corresponds to the VSDBaseline Version variable for 368 Table 1. We applied the same technique for the deviating version, which has 369 9 rows that matched the signature out of a total of 15, yielding a VSD of 370 1-9/15=40%. By using Formula 2, we obtain an RSD of 40% -13% = 27% 371 for the airspeed output metric.

372 3.6. Detecting Behavioural Deviations 373 Finally, all that remains is to decide which RSD values are high enough to 374 indicate a behavioural deviation instead of just measurement noise. Hence, 375 our approach uses thresholds to flag behavioural deviations. If the RSD for a 376 given output metric is larger than the threshold, we consider the deviation too 377 large and hence indicative of a behavioural deviation for that metric. Since 378 each output metric has its own signature, we need to compute and compare 379 each output metric’s RSD value to its corresponding threshold. If the RSD 380 of an output metric exceeds its threshold, this means that that simulator 381 version shows behavioural deviations for at least that output metric. 382 In our running example, if we use a deviation detection threshold of 10%, 383 the airspeed output in the deviating version would be flagged as a deviation 384 because the RSD of 13% is above that threshold.

12

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 385 4. Experimental Setup

386 This section explains the organization of the experiments that we per- 387 formed to evaluate how well the approach presented in the previous section 388 is able to identify behavioural deviations in a flight simulator. We provide 389 all relevant information as required by the guidelines of Runeson et al. [43].

390 4.1. Objective 391 The objective of our experiments is to evaluate the degree to which our ap- 392 proach is able to identify behavioural deviations in real flight simulators. Fur- 393 thermore, we aim to perform this evaluation at two different levels: system- 394 level and component-level. System-level evaluation considers the behaviour 395 of the complete simulator, observing the behaviour (i.e., metric data) of all 396 its components, from cockpit, over simulation engine to specific subsystems 397 like the hydraulic components. Component-level evaluation focuses on one 398 specific component, considering only execution data from that component. 399 Through this system- and component-level evaluation, we aim to address 400 the following research questions:

401 • RQ1: What is the performance of the approach for faults that change 402 the flight environment?

403 • RQ2: What is the performance of the approach for functional faults?

404 4.2. Subject Systems and Use Case Studied 405 Since we did not have access to commercial flight simulator data, we 406 searched for realistic open source flight simulators. One of the most pop- 407 ular ones is FlightGear, which we studied at the system-level as well as 408 at component-level (focusing on one of its internal simulation engines, i.e., 409 JSBSim). Here, we discuss both FlightGear (system-level) and JSBSim 410 (component-level).

411 4.2.1. FlightGear 412 The FlightGear flight simulator [38] is a sophisticated open source frame- 413 work for flight simulation. The project was started in April 1996 and its 414 first multi-platform release was developed in July 1997. Since then, it has 415 expanded beyond flight and by now it offers a broad range of 416 configurable features to recreate flight environments similar to reality. It is 417 possible to take off from several airports located all around the world, change

13

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 418 the weather condition by generating storms, and much more. This allows us 419 to use realistic flight scenarios. 420 SimGear is FlightGear’s default simulation engine and it is used both for 421 research purposes and as a stand-alone flight sim [45, 41, 40]. FlightGear 422 exposes the internal state of the SimGear simulation through a property 423 database that dynamically maps a value (such as speed or latitude) onto an 424 object with getter and setter methods. In our study, we used FlightGear 425 version 3.2 on a recent computer with high performance video cards (CPU: 426 Intel i5-2500K, RAM 8GB, OS: Windows 7) to perform smooth virtual flights 427 at a maximum frame rate. 428 For our evaluation, we focused on one use case in which we manually 429 piloted and recorded flight data with version 3.2 of Flight Gear. This use 430 case consists of a normal landing maneuver using the default FlightGear con- 431 figuration, which uses a Boeing 700-300 aircraft and no extra environmental 432 conditions. We opted for this use case as landing (just like take-off) is one 433 of the most critical moments of a flight, while the Boeing 700-300 is one of 434 the most popular flight models (and is also the default flight model used in 1 435 FlightGear). To pilot the aircraft, we used a FlightGear tutorial and we 436 also received advice from experts of a major flight simulator company.

437 4.2.2. JSBSim 438 JSBSim is an open source flight simulator engine conceived in 1996 as 439 a batch application that aims at modeling flight dynamics and control for 440 aircrafts. It has already been used as a simulation tool by major industry 441 organizations (e.g., NASA) and academia [7, 15, 16, 17] (e.g., University of 442 Naples Federico II, Italy, and RWTH Aachen University, Germany). An 443 end user can build her own simulation components without touching the 444 binary code (from a black box perspective) by editing XML files that describe 445 aircrafts, engines, flight scripts, etc. It is possible to store performance and 446 dynamics of a flight in data files (usually in CSV format) that can be used 447 for further analysis and studies. JSBSim is also integrated in other flight 448 simulators adding new features such as full 3D rendering of a flight and 2 3 449 various sound effects (e.g., FlightGear, Outerra, OpenEaagles ). 450 JSBSim can be scripted to automatically execute flight runs using an

1http://www.flightgear.org/Docs/getstart/getstartch6.html 2http://www.outerra.com 3http://www.openeaagles.org

14

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 451 XML file that specifies navigation events where each event can be triggered on 452 a certain condition (e.g., triggering an event if altitude reach 10,000 ft). When 453 triggered, an event executes a series of defined actions that set properties of 454 the simulator using a lookup table, a static value or a function. It also prints 455 defined notifications giving feedback to the end user about performed actions. 456 Properties set by an event can be any independent simulation property such 457 as environmental properties (e.g., wind), flight dynamic coefficients, control 458 interfaces (e.g., a switch), and so on. JSBSim properties can be set/read using 459 a single interface via the Property Manager system allowing interaction with 460 the system using various communication protocols (e.g., UDP), script files, 461 specification files, and the command line. Properties are categorized using a 462 tree-like structure similar to the Unix file system composed of a root node, 463 sub-nodes (like directories) and end nodes (properties). Properties can be 464 static or a function applying algebra on other static properties to generate 465 an output value. We used JSBSim 0.1.2 in our experiments.

466 4.3. Experimental Method: Injecting Faults 467 Similar to existing deviation detection approaches, an organization want- 468 ing to adopt our approach needs to have a set of simulator versions that 469 are known to be “correct” (i.e., passed qualification testing) to calibrate and 470 build models on. For example, Herzig et al. [28] used a historic set of recorded 471 test executions mapped to different versions of the analyzed software version 472 to collect data to simulate the behaviour of their test selection strategy. 473 Since no databases exist with execution data of the two subject systems, 474 and hence no oracles, we instead inject faults into the systems for which 475 we know which metrics should show a behavioural deviation [4]. We then 476 use our approach to compare the version with fault (deviating version) to 477 the version without fault (baseline version). Given that FlightGear is a full 478 slight simulator system, while JSBSim only is a simulation engine (e.g., no 479 graphical user interface), we injected different faults in each. This section 480 explains the faults injected.

481 4.3.1. FlightGear 482 To determine which types of faults to inject, it is important to understand 483 how FlightGear works. At run-time, FlightGear loads an aircraft model of 484 the flight behaviour, a configuration for the flight environment (e.g., wind 485 turbulence), and an input file containing the pilot’s cockpit operations in

15

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 Table 2: Injected Faults for FlightGear.

Fault Name Type Description Behaviour Weak Wind Weather Wind at Normal 5 knots 25 deg. Strong Wind Weather Wind at Deviations 25 knots 25 deg. Flaps Fault Functional Flaps Deviations disabled Auto Brake Functional Auto brake Deviations Fault disabled Right Engine Functional Failure of Deviations Fault right engine

486 case of an automatic flight. Hence, we decided to play with these elements 487 to generate new simulator versions with known faults. 488 Since we are performing the same use case on all versions of the sim- 489 ulator, the input file needs to be identical for all versions, modulo random 490 noise to simulate realistic, but harmless timing differences between the pilot’s 491 operations. These differences make the flight data more realistic, as human 492 test runs for the same qualification test will never be identical. As such, 493 for our experiments, we can make changes either to the flight environment 494 (e.g., different default settings for wind) or to the aircraft model flight be- 495 haviour (e.g., a coding error in the flight model), obtaining a version of the 496 simulator for which we expect clearly different behaviour than the baseline 497 version. This more liberal definition of “fault” allows us to experiment not 498 only with incorrect behaviour, but with different behaviour in general. The 499 resulting faults in our experiments were obtained through discussions with 500 experts from a commercial flight simulator company, with which the authors 501 are collaborating in the context of the FRQNT ACACIA project. 502 Table 2 summarizes the five faults that we injected into FlightGear. 503 Note that although some of these faults seem easy to spot manually on 504 small data sets if one knows what behavioural deviations to look for, the 505 challenge addressed by our approach is to spot those deviations on large 506 data sets, without prior knowledge about what deviations are present. We 507 now discuss these five faults.

16

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 508 Weak/Strong Wind. Weather conditions, such as a strong wind, add stress 509 to the aircraft during the flight and can sometimes cause accidents in critical 510 flight phases like the landing phase. Therefore, we injected two wind faults 511 with different magnitudes, one introducing a weak tail wind of 5 knots (10 512 km/h) at 25 degrees clockwise, and one introducing strong tail wind of 25 513 knots (50 km/h) at the same angle. A strong tailwind is well recognized to be 514 dangerous while landing or taking off, since it reduces the lift and increases 4 515 speed . On the other hand, the weak tail wind should be considered as 516 normal, because it does not cause major deviations. Hence, the weak wind 517 fault is a false positive that we added to our set of faults to evaluate false 518 alarms generated by our approach.

519 Flaps Fault. Flaps are an essential aircraft wing component that, when ex- 520 tended, increase the camber of the wing, and raise the maximum lift coef- 521 ficient. This coefficient has to be increased to perform a take-off operation, 522 and decreased for landing. Many accidents are caused by flaps not deployed 523 during take-off or landing phases [46]. The Flaps Fault has all flaps disabled, 524 i.e., when a pilot activates the flaps, nothing happens.

525 Auto Brake Fault. Auto brakes are automatic wheel-based hydraulic brake 526 systems used during the take-off and landing phases. There are known 527 cases were these brakes have caused accidents (especially during the land- 5 528 ing phase) . We injected an Auto Brake fault in which the auto brake does 529 not work during the landing phase.

530 Right Engine Fault. Engines are the propulsion systems of the aircraft and 531 there are multiple known cases where a problem with the engines caused 532 accidents [10]. We injected a fault where the right engine does not work 533 during the landing phase.

534 4.3.2. JSBSim 535 Given that JSBSim is a flight simulator component instead of a full 536 simulator system, we could not inject environment-related faults (RQ1). 537 Instead, we only injected faults in the aircraft model, which models the 538 aircraft’s behaviour. Hence, we developed different mutated versions of 539 the c310.xml aircraft model (/aircraft/c310/c310.xml), on

4https://www.tc.gc.ca/eng/civilaviation/publications/tp185-1-05-614-2901.htm 5http://www.fss.aero/accident-reports/look.php?reports key=488

17

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 540 which we performed multiple executions of the flight scenario encoded in 541 c3104.xml (/script/c3104.xml) to generate flight data. The 542 c310.xml aircraft model represents the Cessna C310, which is a small recre- 543 ative/military aircraft developed by Cessna in the 1960’s (see the specifica- 544 tions in Table 3).

Table 3: C310 specifications (short form)

Specification Value No. Cylinders 6 HP 260 Fuel Capacity 102 gallons Max Takeoff Weight 4,830 lbs. Max Landing Weight 4,600 lbs. Standard Empty Weight 3,125 lbs. Max. Useful Load 1,750 lbs.

545 The c3104.xml flight script specifies a small round-trip scenario where 546 the c310 plane takes off from Ellington airport in Texas, cruises around until 547 reaching a given waypoint, then lands again on Ellington airport. 548 To inject the faults in the Cessna aircraft model, we used the help of the 549 third author, who is one of the original developers of JSBSim [15]. This re- 550 sulted in three different faults, which we applied individually and all together, 551 yielding four different fault configurations. We discuss the three individual 552 faults below.

553 Inertia Fault. The moment of inertia measures the inertia of a rigid body 554 when its rotation speed changes and determines the needed for a 555 desired angular acceleration about a certain rotational axis. It depends on the 556 body’s mass distribution and the axis chosen, with larger moments requiring 6 557 more torque to change the body’s rotation rate . For an aircraft, considering 558 the pitch axis for instance, the larger the moment of inertia with respect 559 to the pitching axis y, usually named Iyy, the stronger the inertial couple 560 counteracting pitch accelerationsq ˙. We injected a typical uncertainty of the 561 aircraft moments of inertia, e.g. Iyy and also added a nonzero Ixz (product of

6https://en.wikipedia.org/wiki/Moment_of_inertia

18

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 562 inertia). The latter is often set to zero when a mass layout is not accurately 563 known, assuming that the aircraft reference axes are central axes of inertia.

564 Lift Fault. The lift coefficient CL is a dimensionless coefficient that relates 565 the lift generated by a lifting body to the fluid density around the body, 7 566 the fluid velocity and an associated reference area . We modified the lift 567 coefficient dependency on angle of attack (AoA), adding an offset to the 568 function “aero/coefficient/CLo”. This can be seen as an uncertainty on the 569 aerodynamic lift coefficient at zero AoA.

570 Pitch Fault. An aircraft develops an aerodynamic pitching moment with re- 8 571 spect to the center of gravity. The dependency of the aerodynamic pitching 572 moment on the AoA, and consequently on the lift coefficient, is of great im- 573 portance for the vehicle’s rotational equilibrium (ability to be “trimmed” in 574 pitch) and stability (ability to return to equilibrium flight conditions after 575 external perturbations). We have modified the pitching moment coefficient 576 curve versus AoA. The original dependence is linear in AoA, given by the 577 sum: Cm = Cm0 + Cmα α

578 This is a typical linear expression where α is the AoA, and Cmα is called 579 “longitudinal static stability”. The above function is acceptable when

CL ≈ CL0 + CLα α

580 where CLα is known as the “lift curve slope”. The linear dependency of CL 581 on α is accurate for angles of attack smaller than the so called “alpha-star”, ⋆ ⋆ 582 i.e. a certain value α for a given aircraft. For α>α the CL varies non- ⋆ 583 linearly and reaches a maximum at the so called “α of stall” (αstall > α ). 584 This behaviour is fairly well modelled in JSBSim in terms of lift but not in 585 terms of pitching moment. We have introduced a nonlinear variation of Cm ⋆ 586 that typically occurs for α>α and matches the nonlinear behaviour of CL.

587 4.4. Experimental Method 588 This section explains how we implemented each step of our approach for 589 our experiments. The next section then explains how we determined the

7https://en.wikipedia.org/wiki/Lift_coefficient 8https://en.wikipedia.org/wiki/Pitching_moment

19

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 590 RSD threshold to determine whether or not a given signature deviation is 591 normal. 592 Out of all metrics available in the simulator, we retained all metrics linked 593 to manipulations of cockpit controllers as input metrics, while metrics linked 594 to cockpit-observable data are used as output metrics. We considered 25 595 input metrics and 26 output metrics for FlightGear (a summary is presented 596 in Table 4).

Table 4: FlightGear Output metrics summary.

output Category 3 Position 4 Orientation 8 Velocity 6 Acceleration 5 Surface Position

597 We selected these metrics based on the importance that they have from 598 a pilot’s perspective for assessing the behaviour of the aircraft and control- 599 ling it. Input metrics indicate the data of all controls available from the 600 cockpit such as throttle, elevators, ailerons, landing gears, auto brake, flaps, 601 autopilot controls and aircraft control switches. Output metrics indicate the 602 data produced by cockpit instruments like the airspeed indicator, altitude 603 indicator, altimeter, turn coordinator, directional gyro, and vertical speed 604 indicator. We first selected output metrics that are displayed by these in- 605 struments, such as Airspeed, Roll, Pitch, Altitude, Heading, and Vertical 606 Speed, then selected additional output metrics, not necessarily visible from 607 the cockpit, that often are used by analysts to interpret a flight’s behaviour. 608 We generated the metric data for each faulty version by running the 609 version according to the use case at hand, using FlightGear’s or JSBSim’s 610 built-in data collection tool configured with a sampling rate of one sample 611 per second, which gives enough data without taking too much disk space. 612 We discretized data into a maximal number of K=3 bins (labeled “low”, 613 “medium”, and “high”), which is a number of bins that can be easily inter- 614 preted by a human. We used the Weka data mining tool [26] to generate our 615 signatures and R scripts [42] for the analysis of deviations. 616 To evaluate how well the approach works for the two research questions, 617 we need to know each fault’s oracle, i.e., the metrics that really are deviating

20

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 618 in those scenarios and hence should be flagged by our approach. Since we 619 injected the faults ourselves, with the help from experts of a commercial 620 flight simulator company as well as the third author, we could obtain the 621 FlightGear and JSBSim oracles of Table 5 and Table 6, respectively.

Table 5: FlightGear oracles for each fault.

Fault Deviating Flight Metric Categories Weak Wind Nothing Strong Wind Velocity and Position Flaps Fault Velocity and Orientation Auto Brake Acceleration, Position, Fault Orientation and Velocity Right Engine Acceleration, Fault Velocity and Orientation

Table 6: JSBSim oracle for metric deviations in the output metrics for each injected fault, where “X” means that there is a deviation. The category to which each output metric belongs is shown in the Category column, where “P” means Position, “O” Orientation, “V” Velocity, and “A” Acceleration.

Output Metric Total Inertia Lift Pitch Category fdm//position/h-sl-ft X X X P fdm/jsbsim/position/lat-geod-deg X P fdm/jsbsim/altitude/pitch-rad X O fdm/jsbsim/altitude/roll-rad X O fdm/jsbsim/attitude/theta-rad X O fdm/jsbsim/velocities/u-fps X X X X V fdm/jsbsim/velocities/v-fps X X X X V fdm/jsbsim/velocities/w-fps X X V fdm/jsbsim/velocities/p-rad-sec X X X V fdm/jsbsim/velocities/q-rad-sec X X V fdm/jsbsim/velocities/r-rad-sec X V fdm/jsbsim/accelerations/pdot-rad-sec2 X X X A fdm/jsbsim/accelerations/qdot-rad-sec2 X X A fdm/jsbsim/accelerations/rdot-rad-sec2 X A fdm/jsbsim/accelerations/vdot-ft-sec2 X X X X A

21

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 For each output metric, we also mentioned the metric’s general category, since many metrics are measuring related concepts and experts tended to think in terms of metric category deviations rather than individual output metric deviations. Based on these oracle, we then calculate precision, re- call and false alarm rate of our approach in terms of metric categories (not individual metrics). Precision corresponds to the percentage of correctly flagged deviations (i.e., absence of false alarms), as shown in Formula 3. Re- call is the percentage of all deviations in the oracle that the approach is able to identify, cf. Formula 4. Finally, for faulty versions without deviations (like the Weak Wind fault), we use the False Alarm Rate in Formula 5 to measure how well the approach avoids false alarms. # Correctly identified deviation categories Precision = (3) #Identified deviation categories

# Correctly identified deviation categories Recall = (4) #Deviation categories

#Incorrectly identified deviation categories False Alarm Rate = (5) #Non-deviation categories

622 In the running example, the Airspeed metric belongs to the Speed cat- 623 egory. If this metric’s category would be part of the oracle for the running 624 example, our approach would have 100% recall/precision because it flags this 625 output metric category.

626 4.5. Calibration 627 Ideally, we aim to have both high precision and recall, while for false 628 alarm rate lower values are better. In practice, each performance metric 629 goes at the expense of the other, hence a trade-off is necessary. Given the 630 safety-critical nature of flight simulators, we favour recall over precision and 631 false alarm rate. The most direct way to control this trade-off is via the RSD 632 threshold. The lower this threshold, the more output metrics will be flagged 633 as deviating, which will increase recall and likely reduce precision. 634 To determine the threshold value to use, we use two approaches, resulting 635 in an “initial threshold” and “optimal threshold”. The initial threshold is 636 suitable when one starts to adopt the proposed approach (since all one needs

22

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 637 are one or more correct simulator versions), or when the new simulator ver- 638 sion is too different from the previous one. The optimal threshold on the 639 other hand exploits results of previous runs of the approach to find the best 640 possible threshold based on historical results. Section 6 illustrates in more 641 detail how one can derive the optimal threshold for FlightGear, while this 642 subsection explains how we determine the initial threshold for FlightGear 643 and JSBSim. 644 The initial threshold should be robust enough to tolerate slight variations 645 in the output metrics of the baseline version of the simulator (i.e., the version 646 with known correct behaviour) without generating any false alarm. Such 647 variations could be due to the random noise caused by timing differences 648 between the operations of a pilot or by non-deterministic scheduling of the 649 simulator process by the . 650 To perform the calibration of the initial threshold, we executed different 651 baseline runs of FlightGear and JSBSim in which we gradually add extra 652 weight or light amounts of wind, both of which staying within the aircraft’s 653 tolerances, we then determine a threshold for RSD (Formula 2) that is able 654 to tolerate there non-harmful differences by not flagging false deviations. We 655 played with two ideas to determine the threshold:

656 1. using the lowest threshold that avoids flagging false deviations (via the 657 false alarm rate of Formula 5) 658 2. using the median RSD across the baseline runs

659 We found that for FlightGear, the first approach worked well (in terms 660 of precision/recall for RQ1/RQ2), while for JSBSim, the second (simpler) 661 approach worked well (RQ2). 662 For FlightGear, Fig. 3 (top section) shows, for each output metric cate- 663 gory, a boxplot of the RSD values across the four basic runs. Such a boxplot 664 shows the minimum (lowest whisker), 25th percentile, median (line inside 665 the box), 75th percentile and maximum (highest whisker). To find the low- 666 est threshold that avoids flagging false deviations, we use the false alarm rate 667 of Formula 5, whose results are shown in Table 7. 668 There are 2 distinct clusters of flight metrics, which means that some out- 669 put metrics contain much more noise than others. The first cluster (Cluster 670 1) contains the majority of flight metrics except for “left-aileron-pos-norm” 671 and “right-aileron-pos-norm” (i.e., the two blue boxplots), both of which 672 compose the second cluster (Cluster 2).

23

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 Table 7: False alarm rates for threshold calibration and RQ1 (FlightGear).

Fault Initial Optimal False Alarm Rate False Alarm Rate Baseline Flight 0% 0% Weak Wind 60% 0%

673 If we pick initial thresholds of 15% and 40% respectively for 674 clusters 1 and 2 of FlightGear, we obtain a false alarm rate of 0%. 675 Assuming that the only data available for determining RSD thresholds is the 676 data obtained by running the baseline version of FlightGear (which is the 677 case for any organization adopting the approach for the first time), we can 678 pick an initial threshold of 15% (the red line in Fig. 3) for Cluster 1 and a 679 threshold of 40% (the blue line) for Cluster 2 to obtain a false alarm rate of 680 0%. These thresholds correspond to the clusters’ highest median value plus 681 a safe margin, in order not to flag deviations incorrectly. 682 As mentioned, for our research questions, we will also use an optimal 683 threshold, which can only be determined once more flight data is available, 684 but obtains the best possible performance, which on the calibration data is 685 identical to that of the initial thresholds (i.e., 0% false alarm rate). 686 For JSBSim, we use the median value of each output metric in 687 the baseline runs as threshold. For space purposes, we did not include 688 the baseline runs in Fig. 4. Further research is necessary to determine why the 689 median approach worked better as initial threshold for JSBSim, compared 690 to FlightGear.

691 5. Results

692 This section presents the answers to the research questions based on the 693 results of our experiments with FlightGear (RQ1/RQ2) and JSBSim (RQ2). 694 RQ1: What is the performance of the approach for faults that 695 change the flight environment? 696 Motivation 697 Different weather circumstances can render a flight very different from the 698 original flight. Although technically not a fault, the behaviour is expected to 699 be substantially different in case of strong wind from a baseline run, hence 700 in the context of our experiments we consider strong wind to be a fault. On 701 the other hand, the introduction of a weak wind only slightly modifies the

24

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

position Baseline Flight orientation velocity

acceleration

surface_position position Weak Wind orientation velocity

acceleration

surface_position position Strong Wind orientation velocity

acceleration

surface_position position Flaps Failure orientation velocity

acceleration

surface_position position Auto Brake Failure orientation velocity

acceleration

surface_position position Right Engine Failure orientation velocity

acceleration

surface_position

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 3: Flight metric RSDs for FlightGear by categories, for all scenarios. Red categories contain at least one flagged metric represented by a red (Cluster 1) or a blue boxplot (Cluster 2).

25

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 Total�Fault

Inertia�Fault

Lift�Fault

Pitch�Fault

Figure 4: Flight metric RSDs for JSBSim by categories, for all scenarios. Red categories contain at least one flagged metric represented by a red (Cluster 1) or a blue boxplot (Cluster 2).

26

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 702 baseline flight, and should not be identified as different from the baseline 703 behaviour. In other words, the weak wind case is a sanity check for our 704 approach, to check false positives. 705 Approach 706 We collected data for 3 runs of the Weak Wind FlightGear version and 707 3 runs of the Strong Wind FlightGear version (see the corresponding RSD 708 values on Fig. 3). Using the initial and optimal thresholds, we obtain the 709 precision, recall, and false alarm rate values of Table 7 (last row) and Table 8 710 (first row). Note that we only use FlightGear in this RQ, since it represents 711 a full flight simulator system, whereas JSBSim is only one component and 712 hence environment faults do not apply.

Table 8: Precision and recall for RQ1 and RQ2 (FlightGear and JSBSim).

Fault Initial Initial Optimal Optimal Recall Precision Recall Precision Strong Wind 100% 50% 100% 50% Flaps Fault 100% 40% 100% 67% Auto Brake 50% 100% 100% 100% Fault Right Engine 100% 75% 100% 100% Fault Inertia 100% 44% N/A N/A Fault Lift 75% 25% N/A N/A Fault Pitch 24% 13% N/A N/A Fault Total 75% 25% N/A N/A Fault

713 Results 714 Some FlighGear output metrics have RSDs that are much larger 715 than for other metrics. Especially for Strong Wind we can see that the 716 median RSD values have doubled or even tripled compared to the calibration 717 runs of subsection 4.5. Furthermore, across all flight metrics and both flight 718 environment faults, there is a range of 51% between the lowest RSD me- 719 dian and the highest one. There are 2 particularly large deviations: metrics

27

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 720 “heading-deg” and “vertical-speed-fps”. These are common to both faults, 721 hence these metrics seem to have a strong relation to wind. This makes 722 sense, since the wind for both faults increases the speed of the aircraft and 723 the wind angle changes the aircraft’s direction. 724 We achieve a perfect recall for the Strong Wind configuration, 725 but only 50% precision, while the false alarm rate for the Weak 726 Wind is 60%. For the Weak Wind fault, we wrongly flag output metrics 727 related to orientation, velocity, and acceleration categories because our Clus- 728 ter 1 calibration threshold is too low. We have better results with the higher 729 optimal threshold (false alarm rate of 0%), whereas it is not possible to get 730 better results for the Strong Wind fault. This is because whenever we in- 731 crease the Cluster 1 threshold to narrow our selection of metric categories, we 732 first lose correctly flagged metrics predicted by our oracle before incorrectly 733 flagged metrics start to disappear (RSD below increasing threshold). These 734 results show that, while we do not miss any deviating behaviour (perfect 735 recall), human analysts would still need to consider 50% false alarms (low 736 precision). 737 It is possible to have no false alarms for the Weak Wind version, whereas 738 flagging all deviations for the Strong Wind version is not possible without having 50% false alarms. 739 740 RQ2: What is the performance of the approach for functional 741 faults? 742 Motivation 743 System faults can render a new version’s behaviour very different from 744 the original version and are the cause of many accidents, often impacting 745 people’s lives. For this reason, these injected faults are the most important 746 ones on which we evaluate our approach. 747 Approach 748 We collected data for 3 runs of each of the 3 faulty FlightGear versions 749 (see related RSD values on Fig. 3), as well as 5 of the 4 faulty JSBSim versions 750 (Fig. 4). We evaluated our approach using the same methodology as RQ1. 751 Table 8 shows the performance results for each of the configurations, for both 752 thresholds of FlightGear and the initial threshold of JSBSim. 753 Results 754 In both flight simulators, most of the faults have deviations 755 across all metric categories. For FlightGear, both the Flaps and Right

28

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 756 Engine injected faults have large deviations (i.e., high RSDs values) by flag- 757 ging almost all metric categories, except for the Position category for the 758 Right Engine Fault. Auto Brake Fault is the fault with the least deviation 759 compared to the baseline run, having low RSD values not higher than 25%. 760 For JSBSim, the Total and Lift faults have the largest RSD values, spread 761 across all metric categories, although the deviations for different metrics are 762 widely varying. Some metrics have a very low RSD value, which is still 763 flagged as deviation, while others have higher RSD values that are not being 764 flagged. This is due to our choice for the median of each baseline metric as 765 threshold for deviations. 766 Except for Pitch fault, we obtain recall values of at least 75%, 767 with four faults not missing any deviation (100% recall). For Flight- 768 Gear, the Auto Brake Fault has a recall lower than 100% using the initial 769 thresholds. Fig. 3 shows that this likely is due to accidental variation of the 770 metrics related to velocity and position categories. As the reported metrics 771 for this injected fault are close to the thresholds, the deviations were narrowly 772 missed. 773 For JSBSim, the Pitch fault has a recall of only 24%. This is due to 774 the low RSDs obtained for this fault, i.e., the proportion of deviating snap- 775 shots is too similar to that of the baseline runs, causing the RSD to fall 776 below the threshold. The better the JSBSim thresholds are calibrated, the 777 less sensitive the approach to fluctuations in the baseline runs and hence 778 the higher the corresponding recall for Pitch fault will be. In other words, 779 given its sensitivity, we could use the Pitch fault data to better calibrate our 780 thresholds. 781 Only three faults obtain a precision of at least 67%, while the 782 remaining faults obtain a precision between 13% and 50% 783 In the case of FlightGear, the Flaps fault has a low precision of 40% when 784 using our initial threshold, however it has successfully warned testers about 785 system’s misbehaviour by covering at least all predicted metric categories 786 (100% recall). Increasing the threshold by 40% provides a better precision of 787 67%. For the Right Engine Failure, our approach with initial threshold was 788 close to a perfect precision (75%) with a perfect recall (100%), where it has 789 only triggered a false alarm on the Surface Position category. Increasing the 790 threshold by a few percent (+5%) achieves a perfect detection. 791 In JSBSim, on the other hand, the best precision is 44% for the Inertia 792 Fault, while Pitch fault obtains an abysmal precision of only 13%. The 793 same explanation holds as for the low recall value for Pitch fault, i.e., better

29

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 794 calibration could improve this sensitive fault’s results. Although random 795 guessing would obtain a better precision (50%), the high recall values for 796 all but Pitch fault makes our approach still worth the additional work of 797 checking false positives. 798 The optimal threshold performs markedly better in terms of 799 precision and/or recall than the initial threshold for the Flight- 800 Gear faults. This suggests that similar improvements could resolve the low 801 precision values for JSBSim. Future research should analyze this in detail. We are able to find all injected deviations in 4 out of 7 faults, and 75% 802 of the deviations in 2 other faults, yet especially for JSBSim the precision using the initial thresholds is low (i.e., less than 44%).

803 6. Discussion

804 6.1. Optimal thresholds 805 In our experiments, we have guided our initial threshold choice based on 806 the characteristics of different runs of the baseline flights done with differ- 807 ent weights or light wind within the aircraft’s tolerances. We either tried to 808 minimize the false alarm rate (FlightGear) or to use the median RSD values 809 of the baseline metrics (JSBSim). However, after running our approach on 810 multiple versions, possibly including behavioural deviations, more data be- 811 comes available that can be used to tune the thresholds, in particular to try 812 to optimize the recall for finding deviations. This section analyzes how sen- 813 sitive our approach is to the choice of thresholds, and how much we give up 814 on precision when optimizing for recall (and vice versa). We performed the 815 analysis for FlightGear, but a similar approach can be followed for JSBSim. 816 To do the analysis, we evaluated our approach’s recall/precision for differ- 817 ent thresholds in a range from 0 to 80% (since in practice higher values would 818 lead to no deviation being found), then plot the recall and precision for each 819 threshold in order to find a sweet spot where both values are high (Fig. 6). 820 For the versions without deviations (Baseline Flight and Weak Wind), we 821 plot the false alarm rate instead, which we aim to keep low (Fig. 5). 822 For instance, for the Strong Wind version (Fig. 6), the optimal thresh- 823 old is 20%, since we have the highest recall of 100% and the best precision 824 of 50%. One could also pick the 15% threshold (which offers the same re- 825 call/precision), but a higher threshold might be more conservative, which is 826 better from the false positive point of view. The black lines in Fig. 5 and 827 Fig. 6 are the optimal thresholds used in section 5.

30

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 828 Compared to the initial thresholds, we see that a lower threshold would 829 have been interesting to get a perfect precision/recall for the Auto Brake 830 Fault version (Fig. 6). For the Strong Wind version, we see that a higher 831 precision of 100% is possible, albeit with a lower recall of 50%. The Weak 832 Wind and Flaps Fault versions (Fig. 5 and Fig. 6) need high thresholds to 833 reach a lower false alarm rate, respectively higher precision, however for the 834 Flaps Fault we cannot get a perfect precision (maximum of 67%). The Right 835 Engine Fault version (Fig. 6) requires a slightly higher threshold (5%) than 836 the initial one to have perfect precision/recall. 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

(a) Baseline Flight (b) Weak Wind

Figure 5: False alarm rate across multiple thresholds. The black lines represent the optimal thresholds.

837 As is clear from these observations, if one wants to obtain an optimal 838 precision (with perfect recall), one would need to pick a different threshold 839 for each new simulator version. Since in practice one does not know whether 840 a new version contains a deviation (and if so, which one), as this is the 841 whole point of our approach, using version-specific thresholds is impractical. 842 Hence in practice, the high precision values of RQ1 and RQ2 for the opti- 843 mal thresholds cannot be achieved at the same time. Yet, considering that 844 current practices require 100% human intervention, this is still an important 845 improvement. 846 However, if one would pick the optimal (recall) threshold of the Auto

31

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 Recall Recall Precision Precision

0.00.0 0.2 0.4 0.6 0.2 0.8 1.0 0.4 0.6 0.8 0.00.0 0.2 0.4 0.6 0.2 0.8 1.0 0.4 0.6 0.8

(a) Strong Wind (b) Flaps Fault

Recall Recall Precision Precision

0.00.0 0.2 0.4 0.6 0.2 0.8 1.0 0.4 0.6 0.8 0.00.0 0.2 0.4 0.6 0.2 0.8 1.0 0.4 0.6 0.8

(c) Auto Brake Fault (d) Right Engine Fault

Figure 6: Recall and precision values across multiple thresholds for injected faults. The black lines represent the optimal thresholds of a version with an injected fault. The plot for Auto Brake Fault does not show its complete precision curve, since from a given threshold on, no true or false deviations are found.

847 Brake Fault version for all faults, one would still achieve a precision between 848 40% and 80%. Although human testers might need to spend some effort 849 analyzing falsely flagged metrics, at least this is less than having to analyze 850 all data manually (as is currently the case).

32

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 851 Although the optimal thresholds are better than the initial thresholds, the 852 initial thresholds derived from a baseline version still obtain a relatively good 853 overall performance, except for precision in the case of JSBSim. Experiments 854 with more metrics and versions are needed to further refine our findings.

855 6.2. Applicability of our approach 856 At the basis of our approach for detecting behaviour deviations, we need 857 to collect data from multiple system runs of correct version. This suffices 858 to build a model with decent (initial) RSD threshold. The model becomes 859 better if one also has a set of incorrect versions, i.e., versions that did not 860 pass qualification tests, as this allows to have a more accurate model and 861 threshold. 862 Any flight simulator vendor or user should be able to collect at least a 863 set of correct simulator versions, either consisting of older official releases 864 or (even better) via the simulator’s version control system, which has each 865 developer version. As long as these versions allow to collect flight metric data, 866 our approach can be applied. Once data have been collected from multiple 867 runs of a system under test to calibrate our approach, only a few minutes 868 are needed to analyze this data. Depending on the system, collecting data is 869 the most time-consuming task in the test analysis process.

870 6.3. Type of Deviations 871 Our approach naively detects deviations by using the proportion of sam- 872 ples in a given period that violate baseline rules. Since the approach essen- 873 tially ignores the chronological nature of snapshots (instead of a sequence 874 of snapshots, we use a bag of snapshots), it does not cover all possible be- 875 havioural deviations. In time series theory, there are two major types of 876 behavioural deviations, i.e., those that impact the amplitude of a signal (i.e., 877 the values of metrics across time) and those that impact the transients. The 878 latter correspond to the periods in which a system transitions from one value 879 of a metric to another. For example, if a plane is cruising at a speed of 200 880 knots, and the pilot directs the plane to accelerate to 260 knots, the plane will 881 not immediately have the speed of 260 knots, but will transition according 882 to same trend (e.g., linearly) from the initial speed to the desired one. 883 Our current machine learning approach only detects deviations in am- 884 plitude, whereas transient deviations are missed because transients typically 885 are too short and hence not picked up by the decision tree algorithms. For 886 example, if we analyze the altitude of an aircraft during a full flight, the take

33

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 887 off and landing phases would be insignificant compared to the rest because 888 the aircraft is cruising most of the time (i.e., altitude would look constant). 889 However, a deviation during these small transition phases between ground 890 and cruise altitude can be dangerous, e.g., an aircraft landing in a few sec- 891 onds rather than a few minutes represents a danger for passengers. Hence, 892 in future work we will adapt our approach to also support transients, which 893 requires substantial changes to our approach.

894 6.4. Fault Injection vs Bugs 895 Because we did not have access to versions of FlightGear and JSBSim 896 with known behavioural deviations, we artificially injected faults by changing 897 the flight environment or flight behaviour of an aircraft model for a simulator 898 version. In reality, our approach would work for any case where a change (e.g., 899 bug or new feature) introduced in the source by a developer could impact 900 the behaviour of the system, as long as the same use case can be run on both 901 versions and the same flight metrics can be extracted. We expect the most 902 typical scenarios to be between a fixed version and the buggy version, but 903 also between an initial version of a simulator and a version with an updated 904 COTS component, or between an initial version and a heavily refactored one.

905 7. Threats to Validity

906 Construct validity threats concern the relation between theory and obser- 907 vation. To evaluate the precision and recall of our approach, our oracle did 908 not point out individual metrics, but rather categories of flight metrics. The 909 reason for this is that the number of metrics was much larger than the number 910 of available experts, and that the experts contacted for our oracle are domain 911 experts that are sufficiently familiar with categories of flight metrics (Speed, 912 Acceleration, Velocity, an Position), but not with the flight simulator used 913 and its specific flight metrics. For example, we consider the “airspeed-kt” 914 and the “vertical-speed-fps” to be in the same Speed category. Evaluation 915 of the approach at the individual flight metric-level, as well as subsequent 916 root cause analysis to pinpoint the component responsible for a behavioural 917 deviation, is left for future work. 918 Furthermore, some parameters of our evaluation, such as the value of 919 K=3 for discretization and the number of runs considered could impact our 920 findings. Since the generation of data for multiple runs of a use case is

34

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 921 time-consuming, we limited our study to 20 runs. We plan to automatically 922 generate runs using an automated pilot system for FlightGear and JSBSim. 923 Threats to external validity are related to the selection of one specific 924 version of the FlightGear and JSBSim open source flight simulators for our 925 study. Although both simulators are well-known and are widely used for 926 research purposes, they may not be representative of the whole flight simu- 927 lator’s domain. Further studies on commercial flight simulators are needed 928 to test the generality of our approach.

929 8. Conclusion

930 In this paper, we explored a signature-based data mining approach to 931 flag behavioural deviations at two different levels, system and component, of 932 safety-critical systems such as a flight simulator. We evaluated the approach 933 on FlightGear and JSBSim open source flight simulators introducing faults 934 related to both flight environment and flight behaviour. Results show that 935 while we did not miss any deviating behaviour (maximum recall) related to 936 flight environment, human analyst would still consider 50% false alarms (low 937 precision). Considering functional deviating behaviour, we were able to find 938 all injected deviations in 4 out of 7 faults and 75% of the deviations in 2 939 other faults. This is promising, since it enables organizations who want to 940 adopt this approach to obtain a decent performance out-of-the-box.

941 References

942 [1] Manual criteria for the qualification of flight training devices, 3th ed. 943 International Civil Aviation Organization, Quebec, Canada, 2009.

944 [2] Charu C Aggarwal. On abnormality detection in spuriously populated 945 data streams. In SDM, pages 80–91. SIAM, 2005.

946 [3] DJ Allerton. The impact of flight simulation in aerospace. Aeronautical 947 Journal, 114(1162):747–756, 2010.

948 [4] James H Andrews, Lionel C Briand, and Yvan Labiche. Is mutation an 949 appropriate tool for testing experiments?[software testing]. In Software 950 Engineering, 2005. ICSE 2005. Proceedings. 27th International Confer- 951 ence on, pages 402–411. IEEE, 2005.

35

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 952 [5] Vic Barnett and Toby Lewis. Outliers in statistical data, volume 3. 953 Wiley New York, 1994.

954 [6] Jon Berndt. Jsbsim- an open source flight dynamics model in c++. In 955 AIAA Modeling and Simulation Technologies Conference and Exhibit, 956 2004.

957 [7] Jon S Berndt and Agostino De Marco. Progress on and usage of the open 958 source flight dynamics model software library, jsbsim. In AIAA modeling 959 and simulation technologies conference, Chicago, Illinois, USA, pages 960 10–12, 2009.

961 [8] Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, and Arvind Krish- 962 namurthy. Inferring models of concurrent systems from logs of their 963 behavior with csight. In Proceedings of the 36th International Confer- 964 ence on Software Engineering, pages 468–479. ACM, 2014.

965 [9] David Braun and Thomas Galloway. Universal automated flight simula- 966 tor fidelity test system. In AIAA Modeling and Simulation Technologies 967 Conference and Exhibit, 2004.

968 [10] GS Campbell and R Lahey. A survey of serious aircraft accidents in- 969 volving fatigue fracture. International Journal of Fatigue, 6(1):25–30, 970 1984.

971 [11] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly de- 972 tection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.

973 [12] Lin Chen, Ziyuan Wang, Lei Xu, Hongmin Lu, and Baowen Xu. Test 974 case prioritization for web service regression testing. In Service Oriented 975 System Engineering (SOSE), 2010 Fifth IEEE International Symposium 976 on, pages 173–178. IEEE, 2010.

977 [13] Ira Cohen, Jeffrey S Chase, Moises Goldszmidt, Terence Kelly, and Julie 978 Symons. Correlating instrumentation data to system states: A building 979 block for automated diagnosis and control. In OSDI, volume 4, pages 980 16–16, 2004.

981 [14] Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence 982 Kelly, and Armando Fox. Capturing, indexing, clustering, and retrieving

36

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 983 system history. In ACM SIGOPS Operating Systems Review, volume 39, 984 pages 105–118. ACM, 2005.

985 [15] Domenico P Coiro, Agostino De Marco, and Fabrizio Nicolosi. A 6dof 986 flight simulation environment for general aviation aircraft with control 987 loading reproduction. In AIAA Modeling and Simulation Technologies 988 Conference, 2007.

989 [16] Agostino De Marco, Eugene L Duke, and Jon S Berndt. A general 990 solution to the aircraft trim problem. In AIAA Modeling and Simulation 991 Technologies Conference and Exhibit, page 6703, 2007.

992 [17] Agostino De Marco and Jary DAuria. Collision risk studies with 6dof 993 flight simulations when aerodrome obstacle standards cannot be met. In 994 Proceeding of 29th Congress of the International Council of the Aero- 995 nautical Sciences (ICAS 2014), 2014.

996 [18] Eugene L Duke. V&v of flight and mission-critical software. Software, 997 IEEE, 6(3):39–45, 1989.

998 [19] Umut Durak, Artur Schmidt, and Thorsten Pawletta. Ontology for 999 objective flight simulator fidelity evaluation. SNE Simulation Notes Eu- 1000 rope, 24(2):69–78, 2014.

1001 [20] Umut Durak, Artur Schmidt, and Thorsten Pawletta. Model-based test- 1002 ing for objective fidelity evaluation of engineering and research flight 1003 simulators. In AIAA Modeling and Simulation Technologies Conference, 1004 page 2948, 2015.

1005 [21] Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, 1006 Carlos Pacheco, Matthew S Tschantz, and Chen Xiao. The daikon sys- 1007 tem for dynamic detection of likely invariants. Science of Computer 1008 Programming, 69(1):35–45, 2007.

1009 [22] King Foo, Zhen Ming Jiang, Bram Adams, Ahmed E Hassan, Ying Zou, 1010 and Parminder Flora. Mining performance regression testing repositories 1011 for automated performance analysis. In Quality Software (QSIC), 2010 1012 10th International Conference on, pages 32–41. IEEE, 2010.

1013 [23] King Foo, Zhen Ming Jiang, Bram Adams, Ahmed E Hassan, Ying 1014 Zou, and Parminder Flora. An industrial case study on the automated

37

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 1015 detection of performance regressions in heterogeneous environments. In 1016 Software Engineering In Practice (SEIP) track at the 37th International 1017 Conference on Software Engineering, ICSE (Florence, Italy), page to 1018 appear. IEEE, 2015.

1019 [24] Neha Gandhi, Nathan Richards, and Alec Bateman. Simulator evalua- 1020 tion of an in-cockpit cueing system for upset recovery. In AIAA Guid- 1021 ance, Navigation, and Control Conference, 2014.

1022 [25] Patrick Gontar, Hans-Juergen Hoermann, Juergen Deischl, and Andreas 1023 Haslbeck. How pilots assess their non-technical performance–a flight 1024 simulator study. Advances in Human Aspects of Transportation: Part 1025 I, 7:119, 2014.

1026 [26] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter 1027 Reutemann, and Ian H Witten. The weka data mining software: an 1028 update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.

1029 [27] Ahmed E Hassan and Richard C Holt. Predicting change propagation 1030 in software systems. In Software Maintenance, 2004. Proceedings. 20th 1031 IEEE International Conference on, pages 284–293. IEEE, 2004.

1032 [28] Kim Herzig, Michaela Greiler, Jacek Czerwonka, and Brendan Murphy. 1033 The art of testing less without sacrificing quality. In International Con- 1034 ference on Software Engineering, pages 483–493, 2015.

1035 [29] Victoria J Hodge and Jim Austin. A survey of outlier detection method- 1036 ologies. Artificial Intelligence Review, 22(2):85–126, 2004.

1037 [30] Salman Hoseini, Abdelwahab Hamou-Lhadj, Patrick Desrosiers, and 1038 Martin Tapp. Software feature location in practice: debugging aircraft 1039 simulation systems. In ICSE Companion, pages 225–234, 2014.

1040 [31] Stephen A Jacklin, Michael R Lowry, Johann M Schumann, Pramod P 1041 Gupta, John T Bosworth, Eddie Zavala, John W Kelly, Kelly J Hay- 1042 hurst, Celeste M Belcastro, and Christine Belcastro. Verification, val- 1043 idation, and certification challenges for adaptive flight-critical control 1044 system software. In American Institute of Aeronautics and Astronau- 1045 tics (AIAA) Guidance, Navigation, and Control Conference and Exhibit, 1046 pages 16–19, 2004.

38

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 1047 [32] Zhen Ming Jiang, Ahmed E Hassan, Gilbert Hamann, and Parminder 1048 Flora. Automated performance analysis of load tests. In Software Main- 1049 tenance, 2009. ICSM 2009. IEEE International Conference on, pages 1050 125–134. IEEE, 2009.

1051 [33] Jung-Min Kim, Adam Porter, and Gregg Rothermel. An empirical study 1052 of regression test application frequency. Software Testing, Verification 1053 and Reliability, 15(4):257–279, 2005.

1054 [34] G¨ulser K¨oksal, Inci˙ Batmaz, and Murat Caner Testik. A review of data 1055 mining applications for quality improvement in manufacturing industry. 1056 Expert systems with Applications, 38(10):13448–13467, 2011.

1057 [35] Leonardo Mariani, Sofia Papagiannakis, and Mauro Pezze. Compatibil- 1058 ity and regression testing of cots-component-based software. In Software 1059 Engineering, 2007. ICSE 2007. 29th International Conference on, pages 1060 85–95. IEEE, 2007.

1061 [36] Lijun Mei, Zhenyu Zhang, WK Chan, and TH Tse. Test case priori- 1062 tization for regression testing of service-oriented business applications. 1063 In Proceedings of the 18th international conference on World wide web, 1064 pages 901–910. ACM, 2009.

1065 [37] Thanh HD Nguyen, Meiyappan Nagappan, Ahmed E Hassan, Mohamed 1066 Nasser, and Parminder Flora. An industrial case study of automatically 1067 identifying performance regression-causes. In Proceedings of the 11th 1068 Working Conference on Mining Software Repositories, pages 232–241. 1069 ACM, 2014.

1070 [38] Alexander R Perry. The flightgear flight simulator. In 2004 USENIX 1071 Annual Technical Conference, Boston, MA, 2004.

1072 [39] Thien-Anh Pham. Validation and verification of aircraft control software 1073 for control improvement. PhD thesis, Citeseer, 2007.

1074 [40] Juntong Qi, Jinda Liu, Bichuan Zhao, Sen Mei, Jianda Han, and Hong 1075 Shang. Visual simulation system design of soft-wing uav based on flight- 1076 gear. In Mechatronics and Automation (ICMA), 2014 IEEE Interna- 1077 tional Conference on, pages 1188–1192. IEEE, 2014.

39

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 1078 [41] ZH Qi, DW Hu, AL Liu, and XP Hu. Airborne imu simulation based 1079 on and flight gear. Journal of Chinese Inertial Technology, 1080 16(4):400–403, 2008.

1081 [42] R Core Team. R: A Language and Environment for Statistical Comput- 1082 ing. R Foundation for Statistical Computing, Vienna, Austria, 2014.

1083 [43] Per Runeson and Martin H¨ost. Guidelines for conducting and reporting 1084 case study research in software engineering. Empirical software engi- 1085 neering, 14(2):131–164, 2009.

1086 [44] Menja Scheer, Frank M Nieuwenhuizen, Heinrich H B¨ulthoff, and 1087 Lewis L Chuang. The influence of visualization on control performance in 1088 a flight simulator. In Engineering Psychology and Cognitive Ergonomics, 1089 pages 202–211. Springer, 2014.

1090 [45] Eric F Sorton and Sonny Hammaker. Simulated flight testing of an au- 1091 tonomous unmanned aerial vehicle using flightgear. American Institute 1092 of Aeronautics and Astronautics, AIAA 2005, 7083, 2005.

1093 [46] Mobolaji S Stephens and Wilfred I Ukpere. An empirical analysis of the 1094 causes of air crashes from a transport management perspective. Mediter- 1095 ranean Journal of Social Sciences, 5(2):699, 2014.

1096 [47] Mark D Syer, Zhen Ming Jiang, Meiyappan Nagappan, Ahmed E Has- 1097 san, Mohamed Nasser, and Parminder Flora. Leveraging performance 1098 counters and execution logs to diagnose memory-related performance 1099 issues. In Software Maintenance (ICSM), 2013 29th IEEE International 1100 Conference on, pages 110–119. IEEE, 2013.

1101 [48] Chunguang Wang, Junwei Han, Guixian Li, and Hongzhou Jiang. Flight 1102 simulator fidelity evaluation automated test system analysis. In Educa- 1103 tion Technology and Training, 2008. and 2008 International Workshop 1104 on Geoscience and Remote Sensing. ETT and GRS 2008. International 1105 Workshop on, volume 2, pages 726–728. IEEE, 2008.

1106 [49] Chunguang Wang, Jingfeng HeHe, Guixian Li, and Junwei Han. An 1107 automated test system for flight simulator fidelity evaluation. Journal 1108 of Computers, 4(11):1083–1090, 2009.

40

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016 1109 [50] Norman Wilde and Michael C Scully. Software reconnaissance: mapping 1110 program features to code. Journal of Software Maintenance: Research 1111 and Practice, 7(1):49–62, 1995.

1112 [51] Ian H Witten and Eibe Frank. Data Mining: Practical machine learning 1113 tools and techniques. Morgan Kaufmann, 2005.

1114 [52] Shin Yoo and Mark Harman. Regression testing minimization, selection 1115 and prioritization: a survey. Software Testing, Verification and Relia- 1116 bility, 22(2):67–120, 2012.

1117 [53] Jiang Zheng, Brian Robinson, Laurie Williams, and Karen Smiley. Ap- 1118 plying regression test selection for cots-based applications. In Proceed- 1119 ings of the 28th international conference on Software engineering, pages 1120 512–522. ACM, 2006.

41

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2670v1 | CC BY 4.0 Open Access | rec: 23 Dec 2016, publ: 23 Dec 2016