Challenges In Teaching About Software Testing Which Is Not For The Purpose Of Finding Defects

Challenges in teaching about software testing that is NOT for the purpose of finding defects.

W. Morven Gentleman Dalhousie University. Halifax, NS Canada [email protected]

Introduction

Nearly all books and courses on software testing approach the topic from the perspective of a software development organization that wants to improve the quality of its product. Moreover, they treat the topic as synonymous with finding and fixing faults. Typical examples are [1,2,3,4,5,6,7,8,9,10]. This is indeed an important perspective for industry, and one that many students will experience in practice. However, it is not the only perspective under which industry tests software, and in trying to prepare students to test software in some of these other situations, we have become aware of challenges to overcome that arise from the biases and shortcomings in the backgrounds of the students.

Avery few books take a broader view. Books on usability testing [11,12] focus on the interaction between the human operator and the software, where demands on the operator’s cognitive abilities, manual dexterity and avoidance of habitual blunders are being examined, not coding errors. Although I have not examined the books themselves, it appears from the titles and tables of contents of Perry’s books [13,14] that he addresses problems possibly more of interest to customers than developers, such as testing off-the- shelf software or evaluating alternative products, as well as addressing problems beyond normal operation, such as adequacy of installation instructions and system documentation.

Our immediate concerns arise from a new course on performance assessment being introduced at Dalhousie for 4th year undergraduates. Although this course, described in Appendix A, will be offered to undergrads for the first time next year, we have had experience with the challenges students face when taking such a course, because for some time graduate students have had to study the material. The course is not dissimilar to courses on performance measurement offered in many other schools. We believe that the issues go far beyond this course, and affect the ability of students to learn about testing in other situations where the purpose of testing the software is other than finding and fixing faults.

Software testing situations where defect detection is not the primary objective

Before commenting on pedagogical issues, we first need to establish that teaching about this kind of situation is warranted. How common are these situations, how important are they, how varied, and are they novel or well understood? The simplest quantitative measurement of a software system is often to demonstrate that there exist circumstances under which the software can comply with some quantitative requirement. Such experiments too often constitute part of a vendor-defined acceptance test. 1) Testing to meet regulatory or contractual requirements

Testing a software artifact to measure quantitative attributes of the software form one class of situations where detecting defects is not the primary objective. In many cases, the quantitative attribute may depend on one or more parameters. Empirically discovering an approximate model of this dependency is often an essential result of such testing. Customers often conduct such testing, not just vendors. 2) Benchmarking a) measuring throughput, response time and other performance aspects as well as resource consumption for a specific hardware, software and networking configuration presented with a specified load b) comparing two or more hardware, software and networking configurations presented with the same load 3) Sizing and capacity planning a) deriving an approximate model that can predict a hardware, software and networking configuration adequate to meet a specified load b) balancing sizing predictions corresponding to different load characterizations to recommend an adequate affordable hardware, software, and networking configuration 4) Tuning performance to observed load a) where there are one or more continuous parameters, determining optimal values by smoothing, surface fitting, and gradient following. b) where parameters take discrete values, as for numbers of replications in a configuration, determining optimal values by surface fitting and integer programming.

Situations where what is to be measured is some aspect of human behaviour when interacting with the software under test, rather than an attribute of the software itself, form another class for which defect detection is not primary. This testing is often performed by user organizations, not just developers 5) Usability testing a) measuring time taken to learn to use an interface b) measuring improvement in skill with practice c) measuring time taken to perform specific tasks d) measuring memory load imposed on the user e) measuring accuracy of user data entry f) measuring complexity of recovery procedures, including undo g) measuring propensity for blunders h) evaluating subjective impressions 6) Assessing training requirements for operators, installers, etc. 7) Assessing change management activities (including changes to business processes) required before general rollout 8) Tool adoption: studying how to make effective use of a tool in your environment a) measuring achieved productivity improvements, e.g. by throughput b) measuring quality improvement by human/computer system c) measuring user acceptance, i.e. willingness to use in practice

Situations where the investigation is more open-ended, where an essential part of the experiment is finding the questions that need to be asked, forms yet another class 9) Validating a system model or specification, i.e. confirming that nothing has been overlooked, and that there are no exceptional circumstances which the model or specification does not treat correctly 10)Competitive intelligence (studying competitors’ products) 11)Interoperability testing (looking for incompatibilities, not testing glue code) 12)Identifying a way to successfully make use of a product

This list is incomplete, nevertheless it is sufficient to illustrate that there are many familiar situations where testing of software is performed for purposes other than defect detection, and where the fundamental technology for testers to know are questions of what should be measured, what experiments should be performed to make the measurements, how should the experimental measurement be analyzed, and how should the results of the analysis be interpreted.

Observations on teaching experience

Testing software, whether to detect defects or not, is all about experiments. I observe that a large proportion of the students have great difficulty assimilating experimental techniques and experimental evidence. They cannot conceive of interesting questions that could be investigated experimentally. They cannot propose experiments to resolve such questions. They cannot suggest approaches to analyzing experimental data. They cannot interpret experimental results. They cannot recognize when empirical evidence contradicts their assumptions. They do not notice anomalies that might indicate unsuspected mechanisms at work. They cannot manage an investment of a fixed level of experimental effort.

Perhaps surprisingly, this observation applies to technically trained students, and is especially true for computer science students, but may also apply to mathematics students, theoretical physics students and even some engineering students. My diagnosis is that many of these students reach this point in their studies with no prior exposure to the culture of experimental science. They are not imbued with the fundamental precept that all asserted truth must ultimately be founded on experimental observation, and that theory should be seen as a summary of experimental evidence. Whereas more empirically based science students regard theory as an insightful approximation, these students simply accept it as fact, and don’t question it They are given a mathematical model of a problem, and they are unable to distinguish that model from physical reality – they do not question it. The biggest pedagogical challenge is thus overcoming this mindset. The real remedy probably necessitates changes in earlier curriculum, but given the presence of such students, I resort to tying to develop their sophistication through lab experience with convincing examples.

My other observation for such students being introduced to software testing that is not primarily focused on defect detection is that prior study of software testing for defect detection does not necessarily help them. I believe that in part this is because the course instructor or the book that they learned from has encouraged them to view all software testing as searching for defects. The students then attempt to apply strategies such as boundary value testing that they have learned for detecting defects, where these strategies are appropriate, to other forms of software testing where they are not appropriate.

There is, of course, some rationale for this point of view. A typical definition of a defect is “nonconformance to requirements or functional / program specification”. However, searching the web, we can find broader definitions. For instance, “a defect is any flaw in the specification, design, or implementation of a product.” More insightfully: “Operationally, it is useful to work with 2 definitions of a defect: (1) From the producers viewpoint: a product requirement that has not been met or an attribute of a product that is not in the statement of requirements that define the product. (2) From the customers viewpoint: anything that causes customer dissatisfaction, whether in the statement of requirements or not.” In the world of commercial software products (those sold in essentially identical form to many diverse customers) any given customer is unlikely to have had direct input to the requirements and specification to which the product is built, and indeed is likely to be faced by a choice between products that not only are built to different specifications, but for which the detailed specifications may not be available. In such a case the customer is unlikely to care about defects defined by conformance to much of the specification, but will certainly regard defects as serious when they lead to his dissatisfaction.

Thus my quarrel with the view that all software testing is about defect detection is not epistemological, but practical: Viewing all empirical investigations as searching for defects does not help plan or evaluate the investigation. IBM might well choose to regard it as a defect that in some circumstances Oracle will outperform DB2, but that doesn’t help identify what those circumstances are. Designers of different products make different tradeoffs when choosing data structures and algorithms, and it is exceedingly unlikely one choice will be uniformly superior to all others, so circumstances that favour any product over others pretty much always can be found. Again, lab experience with convincing examples seems the best hope for improving the sophistication with which students evaluate product comparisons. References

1. A Practitioner's Guide to Software Test Design -- by Lee Copeland, Artech House Publishers, 2004. 2. Systematic Software Testing (Artech House Computer Library) -- by Rick D. Craig, Stefan P. Jaskiel; Artech House Publishers, 2002. 3. Testing Computer Software, 2nd Edition – by Cem Kaner, Jack Falk, Hung O. Nguyen, Wiley, 1999 4. Software Test Automation (ACM Press) -- by Mark Fewster, Dorothy Graham, Addison-Wesley Professional, 1999 5. Just Enough Software Test Automation -- by Daniel J. Mosley, Bruce A. Posey, Prentice Hall, 2002 6. Software Testing Fundamentals: Methods and Metrics -- by Marnie L. Hutcheson, Wiley 2003 7. Software Testing: A Craftsman's Approach, Second Edition -- by Paul C. Jorgensen, CRC, 2002 8. The Art of Software Testing, Second Edition (Hardcover) -- by Glenford J. Myers, Corey Sandler, Tom Badgett, Todd M. Thomas, Wiley, 2004 9. Black-Box Testing: Techniques for Functional Testing of Software and Systems -- by Boris Beizer, Wiley, 1995 10. Testing and Quality Assurance for Component-Based Software (Artech House Computer Library.) -- by Jerry Zeyu Gao, H. -S. Jacob Tsao, Ye Wu, Artech House Publishers, 2003 11. Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests -- by Jeffrey Rubin, Wiley, 1994 12. Understanding Interfaces, A Handbook of Human-Computer Dialog – by Mark. W. Lansdale and Thomas C. Ormerod, Academic Press, 1994 13. How to Test Software Packages: A Step-By-Step Guide to Assuring They Do What You Want -- by William E. Perry, John Wiley & Sons Inc, April 1, 1986 14. Effective Methods for Software Testing, 2nd Edition -- by William E. Perry, Wiley, 2000

Appendix

CS 4138 Empirical Performance Modeling

Rationale Quantitative empirical investigation is underemphasized in computer science culture: Experimental science is nor normally in the curriculum. Algorithm and queuing network analyses are based only on theoretical (typically asymptotic) models, testing for defects assumes that the only outcomes are success of failure, etc. Nevertheless there are many situations, especially for design choices and tuning, where quantitative prediction of behaviour is important, and where practical approximations to functional dependencies are needed. This course is designed to fill that gap. Objectives The objectives of this course are to prepare students to use empirical performance models. This includes choosing instrumentation, designing experiments, analyzing experimental observations, and modeling the data.

Target Audience Students who may go on to have operational responsibility for software systems need to understand how to tune them to achieve performance under presented load, Students who go on to have design responsibility to choose between alternate implementations based on measured performance need to understand how to make and analyze such measurements. Introductory statistics provides the perspective to think about experimental data, and analysis of algorithms provides the perspective to think about measures of performance and choices of algorithms and data structures, but it is unfortunate that there is no course that provides motivation by developing experience using large software systems. This course is required for Bachelor of Software Engineering students.

Calendar Description

This course addresses testing of actual or simulated systems for quantitative measurement and prediction from empirical models. Topics include: Motivations for quantitative assessment. Measures of load and performance. Instrumentation and challenges in measuring attributes of software artifacts. Design of experiments for efficiently measuring software. Methods for analysis of observed data and interpretation of results.

Recommended prerequisites ENGM 2032 or MATH 2060, CSCI 3110

Detailed course structure may cover

Motivation Performance estimation (absolute or as function of parameters) Throughput, response time, user load, capacity, propagation delay… Resource consumption and sizing (absolute or as function of parameters) Memory consumption, backing store, bus bandwidth, buffer usage Characterization of load Gathering of statistical data regarding usage, deployment, traffic, etc For instance: # of servers, # of clients, maximum # of concurrent users, message size and throughput per client Stress testing Discontinuities, failure modes, … Performance tuning Optimal parameter values and sensitivity; finding minima, graceful degradation… Reliability estimation Defect rates, time to fatal error, missed event rate, help desk load… Hypothesis testing Redesign improves performance, multiprocessor speedup linear… Statistical significance vs. practical significance

Measurements Probes and other instrumentation Black box vs. internal observation Challenges with human-in-the-loop Simultaneity and other challenges in distributed systems Counts Contingency tables Queue lengths Measurement scales (transformation of variables) E.g. time for an operation, operations per unit time, log time… Sampling Replication & reproducibility (including fractional replication) Paired comparison, cohorts Time series Event logs (including time of event) Heisenberg effect Measurement artifact Synchronous confounding Missing data, trimmed & censored data Quantiles and order statistics Recapture of tagged samples Surrogate measures

Design of experiments Sequential sampling Survey sampling (e.g. beta testing) Factors and Covariates Controlled variables i.e. factors Enumerated factors vs. continuous factors Balanced factors Randomized factors Importance sampling Weighting observations Sampled factors (Type II ANOVA) Regressed out factors (residual analysis) Test data selection Exploring the parameter space Stimulation Simulated load vs. real load Background load

Analysis Distributional assumptions and robustness Empirical Data Analysis ANOVA Regression, modeling (curve fitting) Linear model including polynomial regression Stepwise regression Functional forms Nonlinear models Bootstrapping, jackknifing and other resampling plans Data mining Outliers Composing fitted functions

Suggested text Raj Jain, The Art of Computer Systems Performance Analysis. J. Wiley, 1991

Possible instructors

Schedule 3 hrs lecture, 3 hrs lab

Lab or tutorial material The purpose of the lab is to give students exposure to and practical experience with standard challenges and the tools to address them. Lab materials therefore include the tools themselves, together with their documentation, but also sample software systems that need to be modeled.