Testing for Trustworthiness in Scientific Software

Testing for Trustworthiness in Scientific Software Daniel Hook Diane Kelly Queen's University Royal Military College of Canada Kingston Ontario Kingston Ontario Abstract expected – and correct – results. Weyuker notes that in Two factors contribute to the difficulty of testing many cases, an oracle is pragmatically unattainable. scientific software. One is the lack of testing oracles – With scientific software, this is almost always the case. a means of comparing software output to expected and Weyuker further comments that if the oracle is correct results. The second is the large number of tests unattainable, then “from the view of correctness testing required when following any standard testing … there is nothing to be gained by performing the technique described in the software engineering test.” For scientific software, this is a depressing and literature. Due to the lack of oracles, scientists use disturbing conclusion. However, there is instead a goal judgment based on experience to assess of trustworthiness rather than correctness that can be trustworthiness, rather than correctness, of their addressed by testing. software. This is an approach well established for In this paper, we first outline a typical development assessing scientific models. However, the problem of environment for scientific software, then briefly assessing software is more complex, exacerbated by discuss testing. This leads to a discussion of the problem of code faults. This highlights the need for correctness and trustworthiness from the view of the effective and efficient testing for code faults in scientist who develops and uses scientific software. We scientific software. Our current research suggests that then introduce new results from our analysis of a small number of well chosen tests may reveal a high samples of scientific software using mutation testing. percentage of code faults in scientific software and We conclude with future directions for research. allow scientists to increase their trust. 2. An Environment for Scientific Software 1. Introduction Scientific software resides in a rich environment that has several layers of complexity that affect how to In a 1982 paper, Elaine Weyuker [14] pointed out successfully test the software. that we routinely assume a software tester can determine the correctness of the output of a test. The The computer language representation, or software, basis of the assumption is that the tester has an oracle, is the culmination of a series of model refinements a means of comparing the software’s output to each of which adds its own errors and/or Figure 1: Contributors to Computer Output Error for Scientific Software SECSE’09, May 23, 2009, Vancouver, Canada 978-1-4244-3737-5/09/$25.00 © 2009 Crown 59 ICSE’09 Workshop approximations, as shown in Figure 1. The complexity A third goal of testing, that of searching specifically of the refinements is complicated by transitions from for code faults, is almost universally missing in the one knowledge domain to another. For example, if we work practices of scientists [eg. 12]. Yet, Hatton and were coding an application for aircraft design, we Roberts [5] carried out a detailed study that could be using theory from physics, solution demonstrated that accuracy degradation due to techniques from computational fluid dynamics, and unnoticed code faults is a severe problem in scientific algorithms and data structures from computer science. software. Hatton reiterated this observation in 2007 [4], Each of these knowledge domains contributes errors commenting that problems with code faults in and approximations to the models embedded in the scientific software have not gone away. computer code. The final computer output is an We have only come across one technique, accumulation of all such errors and approximations. developed by Roache and colleagues [7, 10, 11], that Assessing correctness of the computer output becomes specifically addresses faults in scientific code. This as complex as analyzing the entire environment shown technique has been developed for software that solves in Figure 1. partial differential equations (pde). Called the Method of Manufactured Solutions, the technique involves 3. Testing Scientific Software manufacturing an exact analytical solution for the computational model. The computer output can be Testing is usually couched in terms of verification compared to the manufactured solution for accuracy and validation. The software engineering definitions of and convergence characteristics. The intent is that any verification and validation, based on process [eg., 6], code faults affecting either of these will be detected. provide no insight into suitable testing activities for The technique is used in niches like computational scientific software. To confound things further, fluid dynamics, but is limited in its applicability. Its verification and validation are not consistently defined limitations are due to the difference in breadth between across the computational science and engineering analytical solutions and full computational solutions, communities [eg., 3, 7, 9, 10, 11, 13]. This lack of the need in some cases to alter the code to use the consistency plus the complexity of the scientific technique, and the fact that pde solvers are only a small software environment contributes to the omission of a fraction of the lines of code that make up the body of major goal in testing, a goal that should be addressed computational software. by what we call code scrutinization. We suggest that We suggest a new model for testing, as shown in for scientific software, there are three separate testing Figure 2. goals that should be addressed, not two. Validation for scientists primarily means checking the computer output against a reliable source, a benchmark that represents something in the real world. In the literature, validation is described by scientists as the comparison of computer output against various targets such as measurements (of either real world or bench test events), analytical solutions of mathematical models, simplified calculations using the computational models, or output from other computer software. Whether that target is another computer program, measurements taken in the field, or human knowledge, the goal of validation is the same: is the computer output a reasonable proximity to the real world? Verification is also described as a comparison of the computer output to the output of other computer software or to selected solutions of the computational Figure 2: Model of Testing for Scientific model. Roache succinctly calls verification "solving Software the equations right" [11]. This includes checking that expected values are returned and convergence happens Figure 2 shows three circles, or cycles, that within reasonable times. The goal of verification is the represent testing activities whose goals are within the assessment of the suitability of the algorithms and the realm of three different specialists. integrity of the implementation of the mathematics. 60 The outer cycle of testing addresses the need to If we consider the three cycles of testing in our assess the software against goals in the scientific model, scientific validation testing is about trust: is the domain. The ultimate goal of the scientist is to use the software giving output that we believe? To gain trust, software to provide data and insight [13] for problems we exercise the software in different ways and in his/her domain. This testing addresses the capability compare the output to different benchmarks in the of the software as a tool for the scientist. We call this scientific domain. As the output conforms to our testing activity scientific validation. expectations, our trust increases. The next cycle of testing addresses the integrity and Similarly for algorithm verification, all we can do is suitability of the algorithms and other solution exercise the implementation of our algorithms until our techniques used to provide the scientific solution. This trust is sufficiently high. It is well known that we is the domain of the numerical analyst. We introduce cannot do exhaustive testing. Scientists have developed the term algorithm verification and refine a definition a number of approaches to judge trustworthiness of the from Knupp and Salari [7]: "Algorithm verification is implementation of their mathematics. For example, the process by which one assesses the code checking that quantities subject to the conservation implementation of the mathematics underlying the laws are in fact conserved or that the matrix solver science". returns a message about ill-conditioning when The inner-most cycle of testing addresses code expected. faults that arise in the realization of models using a Only in the testing cycle of code scrutinization can computer language. We call this step in the scientist’s we possibly tackle the Boolean true/false of testing activity, code scrutinization. This step correctness. For a specific code segment, we may have specifically looks for faults such as one-off errors, an oracle that would allow us to determine correctness. incorrect array indices, and the like. The goal is to The impossibility of exhaustive testing still lingers ensure the integrity of the software code. This step is however. Practically, this means we make judicious not concerned with evaluating the choice of scientific choices for our testing, and we fall back on the or mathematical models. sufficiency of trust. Ideally, the order that these testing activities are

Testing for Trustworthiness in Scientific Software

January 1993 Reason Toexpect That to Theresearchcommunity

Research Booklet 2020-21

Pdf: Software Testing

Experience Doing Software Fault Prediction in an Industrial Environment)

Coverage Testing in a Production Software Development Environment Kent Bortz Regis University

Database Test Data Generation

Empirical Software Engineering at Microsoft Research

Software Systems Engineering Programmes a Capability Approach

Interim Report of a Review of the Next Generation Air Transportation System Enterprise Architecture, Software, Safety, and Human Factors

Shenkar College Report 2013.Pages

A Specification-Based Coverage Metric to Evaluate Test Sets

Software Testing by Statistical Methods Preliminary Success Estimates for Approaches Based on Binomial Models, Coverage Designs