1 The Oracle Problem in Software Testing: A Survey Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz and Shin Yoo

Abstract—Testing involves examining the behaviour of a system in order to discover potential faults. Given an input for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentially incorrect behavior is called the “test oracle problem”. Test oracle automation is important to remove a current bottleneck that inhibits greater overall test automation. Without test oracle automation, the human has to determine whether observed behaviour is correct. The literature on test oracles has introduced techniques for oracle automation, including modelling, specifications, contract-driven development and metamorphic testing. When none of these is completely adequate, the final source of test oracle information remains the human, who may be aware of informal specifications, expectations, norms and domain specific information that provide informal oracle guidance. All forms of test oracles, even the humble human, involve challenges of reducing cost and increasing benefit. This paper provides a comprehensive survey of current approaches to the test oracle problem and an analysis of trends in this important area of software testing research and practice.

Index Terms—Test oracle; Automatic testing; Testing formalism. !

1 INTRODUCTION tomated test input generation been the subject of many significant advances in both Search- Much work on software testing seeks to auto- Based Testing [3], [5], [83], [127], [129] and mate as much of the test process as practical Dynamic Symbolic Execution [75], [109], [162]; and desirable, to make testing faster, cheaper, yet none of these advances address the issue and more reliable. To this end, we need a test of checking generated inputs with respect to oracle, a procedure that distinguishes between expected behaviours—that is, providing an au- the correct and incorrect behaviors of the Sys- tomated solution to the test oracle problem. tem Under Test (SUT). Of course, one might hope that the SUT However, compared to many aspects of test has been developed under excellent design- automation, the problem of automating the for-test principles, so that there might be a test oracle has received significantly less at- detailed, and possibly formal, specification of tention, and remains comparatively less well- intended behaviour. One might also hope that solved. This current open problem represents a the code itself contains pre- and post- condi- significant bottleneck that inhibits greater test tions that implement well-understood contract- automation and uptake of automated testing driven development approaches [136]. In these methods and tools more widely. For instance, situations, the test oracle cost problem is ame- the problem of automatically generating test liorated by the presence of an automatable test inputs has been the subject of research interest oracle to which a testing tool can refer to check for nearly four decades [46], [108]. It involves outputs, free from the need for costly human finding inputs that cause execution to reveal intervention. faults, if they are present, and to give confi- Where no full specification of the properties dence in their absence, if none are found. Au- of the SUT exists, one may hope to construct a 2 partial test oracle that can answer questions for automated test oracles for functional proper- some inputs. Such partial test oracles can be ties [151]. constructed using metamorphic testing (built Despite this work, research into the test or- from known relationships between desired be- acle problem remains an activity undertaken haviour) or by deriving oracular information in a fragmented community of researchers and from execution or documentation. practitioners. The role of the present paper is to For many systems and most testing as cur- overcome this fragmentation in this important rently practiced in industry, however, the tester area of software testing by providing the first does not have the luxury of formal specifica- comprehensive analysis and review of work on tions or assertions, or automated partial test the test oracle problem. oracles [91], [92]. The tester therefore faces The rest of the paper is organised as follows: the daunting task of manually checking the Section 2 sets out the definitions relating to test system’s behaviour for all test cases. In such oracles that we use to compare and contrast the cases, automated software testing approaches techniques in the literature. Section 3 relates must address the human oracle cost problem a historical analysis of developments in the [1], [82], [131]. area. Here we identify key milestones and track To achieve greater test automation and wider the volume of past publications. Based on this uptake of automated testing, we therefore need data, we plot growth trends for four broad cat- a concerted effort to find ways to address the egories of solution to the test oracle problem, test oracle problem and to integrate automated which we survey in Sections 4–7. These four and partially automated test oracle solutions categories comprise approaches to the oracle into testing techniques. This paper seeks to problem where: help address this challenge by providing a • test oracles can be specified (Section 4); comprehensive review and analysis of the ex- • test oracles can be derived (Section 5); isting literature of the test oracle problem. • test oracles can be built from implicit in- Four partial surveys of topics relating to test formation (Section 6); and oracles precede this one. However, none has • no automatable oracle is available, yet provided a comprehensive survey of trends it is still possible to reduce human effort and results. In 2001, Baresi and Young [17] pre- (Section 7) sented a partial survey that covered four topics Finally, Section 8 concludes with closing re- prevalent at the time the paper was published: marks. assertions, specifications, state-based confor- mance testing, and log file analysis. While EFINITIONS these topics remain important, they capture 2 D only a part of the overall landscape of research This section presents definitions to establish a in test oracles, which the present paper covers. lingua franca in which to examine the literature Another early work was the initial motivation on oracles. These definitions are formalised to for considering the test oracle problem con- avoid ambiguity, but the reader should find tained in Binder’s textbook on software testing that it is also possible to read the paper using [23], published in 2000. More recently, in 2009, only the informal descriptions that accompany Shahamiri et al. [165] compared six techniques these formal definitions. We use the theory to from the specific category of derived test or- clarify the relationship between algebraic spec- acles. In 2011, Staats et al. [174] proposed a ification, pseudo oracles, and metamorphic re- theoretical analysis that included test oracles lations in Section 5. in a revisitation of the fundamentals of testing. To begin, we define a test activity as a stim- Most recently, in 2014, Pezze` et al. focus on ulus or response, then test activity sequences 3

the set of activities labels, C is the set of com- ponents, and V is an arbitrary set of values. f(i) RS To model those aspects of the world that are IO independent of any component, like a clock, io we set an activity’s target to the empty set. We use the terms “stimulus” and “observa- tion” in the broadest sense possible to cater to various testing scenarios, functional and nonfunctional. As shown in Figure 1, a stim- ulus can be either an explicit test input from Fig. 1. Stimulus and observations: S is anything the tester, I ⊂ S, or an environmental factor that can change the observable behavior of the that can affect the testing, S \ I. Similarly, an SUT f; R is anything that can be observed observation ranges from an output of the SUT, about the system’s behavior; I includes f’s ex- O ⊂ R, to a nonfunctional execution profile, plicit inputs; O is its explicit outputs; everything like execution time in R \ O. not in S ∪ R neither affects nor is affected by f. For example, stimuli include the configu- ration and platform settings, database table that incorporate constraints over stimuli and contents, device states, resource constraints, responses. Test oracles accept or reject test preconditions, typed values at an input device, activity sequences, first deterministically then inputs on a channel from another system, sen- probabilistically. We then define notions of sor inputs and so on. Notably, resetting a SUT soundness and completeness of test oracles. to an initial state is a stimulus and stimulating the SUT with an input runs it. Observations include anything that can be discerned and 2.1 Test Activities ascribed a meaning significant to the purpose To test is to stimulate a system and observe of testing — including values that appear on an its response. A stimulus and a response both output device, database state, temporal prop- have values, which may coincide, as when erties of the execution, heat dissipated during the stimulus value and the response are both execution, power consumed, or any other mea- reals. A system has a set of components C.A surable attributes of its execution. Stimuli and stimulus and its response target a subset of observations are members of different sets of components. For instance, a common pattern test activities, but we combine them into test for constructing test oracles is to compare the activities. output of distinct components on the same stimulus value. Thus, stimuli and responses 2.2 Test Activity Sequence are values that target components. Collectively, stimuli and responses are test activities: Testing is a sequence of stimuli and response observations. The relationship between stimuli Definition 2.1 (Test Activities). For the SUT p, and responses can often be captured formally; S is the set of stimuli that trigger or constrain p’s consider a simple SUT that squares its input. To computation and R is the set of observable responses compactly represent infinite relations between to a stimulus of p. S and R are disjoint. Test stimulus and response values such as (i, o = activities form the set A = S ] R. i2), we introduce a compact notation for set The use of disjoint union implicitly labels the comprehensions: elements of A, which we can flatten to the tuple L × C × V , where L = {stimulus, response} is x:[φ] = {x | φ}, 4 where x is a dummy variable over an arbitrary we write v(a); to extract its target component, set. we write c(a). To specify two invocations of a single component on the different values, we Definition 2.2 (Test Activity Sequence). A test must write r1i1r2, i2 :[r1, i1, r2, i2 ∈ S, c(r1) = activity sequence is an element of TA = {w | ∗ c(i1) = c(r2) = c(i2) ∧ v(i1) 6= v(i2)]. In the T → w} over the grammar sequel, we often compare different executions T ::= A 0:[0 φ 0]0 T | AT |  of a single SUT or compare the output of in- dependently implemented components of the where A is the test activity alphabet. SUT on the same input value. For clarity, we introduce syntactic sugar to express constraints Under Definition 2.2, the testing activity se- on stimulus values and components. We let quence io:[o = i2] denotes the stimulus of in- f(x) denote ri:[c(i) = f ∧ v(i) = x], for f ∈ C. voking f on i, then observing the response out- put. It further specifies valid responses obeying A test oracle is a predicate that determines o = i2. Thus, it compactly represents the infi- whether a given test activity sequence is an acceptable behaviour of the SUT or not. We nite set of test activity sequences i1o1, i2o2, ··· 2 first define a “test oracle”, and then relax this where ok = i . k definition to “probabilistic test oracle”. For practical purposes, a test activity se- quence will almost always have to satisfy Definition 2.4 (Test Oracle). A test oracle D : 1 constraints in order to be useful. Under our TA 7→ B is a partial function from a test activity formalism, these constraints differentiate the sequence to true or false. approaches to test oracle we survey. As an initial illustration, we constrain a test activity When a test oracle is defined for a test sequence to obtain a practical test sequence: activity, it either accepts the test activity or not. Concatenation in a test activity sequence Definition 2.3 (Practical Test Sequence). A denotes sequential activities; the test oracle practical test sequence is any test activity se- D permits parallel activities when it accepts quence w that satisfies different permutations of the same stimuli and response observations. We use D to distinguish w = T sT rT, for s ∈ S, r ∈ R. a deterministic test oracle from probabilistic Thus, the test activity sequence, w, is practical ones. Test oracles are typically computationally iff w contains at least one stimulus followed by expensive, so probabilistic approaches to the at least one observation. provision of oracle information may be desir- This notion of a test sequence is nothing able even where a deterministic test oracle is more than a very general notion of what it possible [125]. means to test; we must do something to the Definition 2.5 (Probabilistic Test Oracle). A system (the stimulus) and subsequently ob- probabilistic test oracle D˜ : TA 7→ [0, 1] maps serve some behaviour of the system (the obser- a test activity sequence into the interval [0, 1] ∈ R. vation) so that we have something to check (the observation) and something upon which this A probabilistic test oracle returns a real num- observed behaviour depends (the stimulus). ber in the closed interval [0, 1]. As with test A reliable reset (p, r) ∈ S is a special oracles, we do not require a probabilistic test stimulus that returns the SUT’s component oracle to be a total function. A probabilistic test p to its start state. The test activity se- quence (stimulus, p, r)(stimulus, p, i) is there- 1. Recall that a function is implicitly total: it maps every element of its domain to a single element of its range. The fore equivalent to the conventional application partial function f : X 7→ Y is the total function f 0 : X0 → notation p(i). To extract the value of an activity, Y , where X0 ⊆ X. 5 oracle can model the case where the test oracle Definition 2.8 (Completeness). The test oracle is only able to efficiently offer a probability D is complete iff that the test case is acceptable, or for other situations where some degree of imprecision G(a) ⇒ D(a) can be tolerated in the test oracle’s response. While test oracles cannot, in general, be both Our formalism combines a language- sound and complete, we can, nevertheless, theoretic view of stimulus and response define and use partially correct test oracles. activities with constraints over those Further, one could argue, from a purely philo- activities; these constraints explicitly capture sophical point of view, that human oracles specifications. The high-level language view can be sound and complete, or correct. In this imposes a temporal order on the activities. view, correctness becomes a subjective human Thus, our formalism is inherently temporal. assessment. The foregoing definitions allow for The formalism of Staats et al. captures any this case. temporal exercising of the SUT’s behavior We relax our definition of soundness to cater in tests, which are atomic black boxes for for probabilistic test oracles: them [174]. Indeed, practitioners write test plans and activities, they do not often write Definition 2.9 (Probablistic Soundness and specifications at all, let alone a formal one. This Completeness). A probabilistic test oracle D˜ is fact and the expressivity of our formalism, as probabilistically sound iff evident in our capture of existing test oracle approaches, is evidence that our formalism is 1 P (D˜(w) = 1) > +  ⇒ G(w) a good fit with practice. 2 and D˜ is probabilistically complete iff 2.3 Soundness and Completeness 1 G(w) ⇒ P (D˜(w) = 1) > +  We conclude this section by defining sound- 2 ness and completeness of test oracles. In order to define soundness and complete- where  is non-negligible. ness of a test oracle, we need to define a The non-negligible advantage  requires D˜ concept of the “ground truth”, G. The ground to do sufficiently better than flipping a fair truth is another form of oracle, a conceptual coin, which for a binary classifier maximizes oracle, that always gives the “right answer”. entropy, that we can achieve arbitrary confi- Of course, it cannot be known in all but the dence in whether the test sequence w is valid most trivial cases, but it is a useful definition by repeatedly sampling D˜ on w. that bounds test oracle behaviour.

Definition 2.6 (Ground Truth). The ground 3 TEST ORACLE RESEARCH TRENDS truth oracle, G, is a total test oracle that always gives the “right answer”. The term “test oracle” first appeared in William Howden’s seminal work in 1978 [99]. In this We can now define soundness and complete- section, we analyze the research on test oracles, ness of a test oracle with respect to G. and its related areas, conducted since 1978. Definition 2.7 (Soundness). The test oracle D is We begin with a synopsis of the volume of sound iff publications, classified into specified, derived, implicit, and lack of automated test oracles. We D(a) ⇒ G(a) then discuss when key concepts in test oracles were first introduced. 6

)%$*++ 45)106)7*8-&1%)3* !"#"$%&'''( " ,-"#"$%.//))" 4)-0()5*6-&1%)3* !"#"$%&'()*%+,-*" +0$" ./"#"$%0$*1'" &$$" +$$"

)0$" '-$"

)$$" '$$"

&0$" *-$"

&$$" *$$"

0$" -$"

!"##$%&'()**+$#,)-**".**/$,%01&'"23* $" $" !"##$%&'()**+$#,)-**".**/$,%01&'"23* &.'&" &.'0" &.'*" &.''" &.'." &..&" &..0" &..*" &..'" &..." )$$&" )$$0" )$$*" )$$'" )$$." )$&&" '$$*" '$$-" '$$," '$$1" '$$0" '$**" *01*" *01-" *01," *011" *010" *00*" *00-" *00," *001" *000" '$$$" '$$'" '$$&" '$$(" '$$+" '$*$" '$*'" &.'$" &.')" &.'+" &.'/" &.'1" &..$" &..)" &..+" &../" &..1" )$$$" )$$)" )$$+" )$$/" )$$1" )$&$" )$&)" *01$" *01'" *01&" *01(" *01+" *00$" *00'" *00&" *00(" *00+"

1.4727 4#5%0106*7-&1%)3* !"#"$%&'()*'%+,-'" Handling the lack of oracles y = 0.2997x ./"#"$%)(-&+" R² = 0.8871 )$" 60 ($" 50 +$" 40 ,$" 0$" 30 -$" 20 &$" 10 '$"

!"##$%&'()**+$#,)-**".**/$,%01&'"23* $" 0 Commulave Number of Publicaons 1991 1995 1997 1998 1999 2001 2005 2007 2008 2009 2011 &$$'" &$$," &$$(" &$$)" &$$1" &$''" '1))" '1)1" '11'" '11," '11(" '11)" '111" 2000 2002 2003 2004 2006 2010 2012 &$$$" &$$&" &$$-" &$$0" &$$+" &$'$" &$'&" 1990 1992 1993 1994 1996 '11$" '11&" '11-" '110" '11+"

Fig. 2. Cumulative number of publications from 1978 to 2012 and research trend analysis for each type of test oracle. The x-axis represents years and y-axis the cumulative number of publications. We use a power regression model to perform the trend analysis. The regression equation and the coefficient of determination (R2) indicate a upward future trend, a sign of a healthy research community.

3.1 Volume of Publications test oracle. We constructed a repository of 694 publica- Specified test oracles, discussed in detail in tions on test oracles and its related areas from Section 4, judge all behavioural aspects of a 1978 to 2012 by conducting web searches for system with respect to a given formal specifi- research articles on Google Scholar and Microsoft cation. For specified test oracles we searched Academic Search using the queries “software + for related articles using queries “formal test + oracle” and “software + test oracle”2, + specification”, “state-based specification”, for each year. Although some of the queries “model-based languages”, “transition-based generated in this fashion may be similar, dif- languages”, “assertion-based languages”, “al- ferent responses are obtained, with particular gebraic specification” and “formal + confor- differences around more lowly-ranked results. mance testing”. For all queries, we appended We classify work on test oracles into four the keywords with “test oracle” to filter the categories: specified test oracles (317), derived results for test oracles. test oracles (245), implicit test oracles (76), and Derived test oracles (see Section 5) in- no test oracle (56), which handles the lack of a volve artefacts from which a test oracle may be derived — for instance, a previous ver- 2. We use + to separate the keywords in a query; a sion of the system. For derived test oracles, phrase, not internally separated by +, like “test oracle”, is a compound keyword, quoted when given to the search we searched for additional articles using the engine. queries “specification inference”, “specification 7 mining”, “API mining”, “metamorphic test- For each type of test oracle and the advent of ing”, “regression testing” and “program doc- a technique or a concept, we plotted a timeline umentation”. in chronological order of publications to study An implicit oracle (see Section 6) refers to research trends. Figure 3 shows the timeline the detection of “obvious” faults such as a pro- starting from 1978 when the term “test oracle” gram crash. For implicit test oracles we applied was first coined. Each vertical bar presents the the queries “implicit oracle”, “null pointer + technique or concept used to solve the problem detection”, “null reference + detection”, “dead- labeled with the year of its first publication. lock + livelock + race + detection”, “memory The timeline shows only the work that is leaks + detection”, “crash + detection”, “per- explicit on the issue of test oracles. For exam- formance + load testing”, “non-functional + ple, the work on test generation using finite error detection”, “fuzzing + test oracle” and state machines (FSM) can be traced back to as “anomaly detection”. early as 1950s. But the explicit use of finite There have also been papers researching state machines to generate test oracles can strategies for handling the lack of an auto- be traced back to Jard and Bochmann [103] mated test oracle (see Section 7). Here, we and Howden in 1986 [98]. We record, in the applied the queries “human oracle”, “test mini- timeline, the earliest available publication for a mization”, “test suite reduction” and “test data given technique or concept. We consider only + generation + realistic + valid”. published work in journals, the proceedings Each of the above queries were appended of conferences and workshops, or magazines. by the keywords “software testing”. The re- We excluded all other types of documentation, sults were filtered, removing articles that were such as technical reports and manuals. found to have no relation to software testing Figure 3 shows a few techniques and con- and test oracles. Figure 2 shows the cumulative cepts that predate 1978. Although not explicitly number of publications on each type of test on test oracles, they identify and address issues oracle from 1978 onwards. We analyzed the for which test oracles were later developed. For research trend on this data by applying differ- example, work on detecting concurrency issues ent regression models. The trend line, shown (deadlock, livelock, and races) can be traced in Figure 2, is fitted using a power model. back to the 1960s. Since these issues require The high values for the four coefficients of no specification, implicit test oracles can and determination (R2), one for each of the four have been built that detect them on arbitrary types of test oracle, confirm that our models are systems. Similarly, Regression Testing detects good fits to the trend data. The trends observed problems in the functionality a new version of suggest a healthy growth in research volumes a system shares with its predecessors and is a in these topics related to the test oracle problem precursor of derived test oracles. in the future. The trend analysis suggests that propos- als for new techniques and concepts for the formal specification of test oracles peaked in 3.2 The Advent of Test Oracle Techniques 1990s, and has gradually diminished in the last We classified the collected publications by tech- decade. However, it remains an area of much niques or concepts they proposed to (partially) research activity, as can be judged from the solve a test oracle problem; for example, Model number of publications for each year in Fig- Checking [35] and Metamorphic Testing [36] ure 2. For derived test oracles, many solutions fall into the derived test oracle and DAISTIS have been proposed throughout this period. [69] is an algebraic specification system that Initially, these solutions were primarily theoret- addresses the specified test oracle problem. ical, such as Partial/Pseudo-Oracles [196] and 8

Handling the Lack of Test Oracles pre-78 93 00 02 05 11 12 UsageMining Partition Testing Testing Partition Delta Debugging MachineLearning Realistic Test Data Test Realistic InputClassification Test Size Reduction Test Implicit Test Oracles pre-78 88 90 98 06 08 Fuzzing Load Testing Testing Load AnomalyDetection ExceptionChecking RobustnessChecking MemoryLeaks Detection ConcurrencyIssuesDetection MiningConcurrencyfor Issues Derived Test Oracles pre-78 82 83 86 94 98 99 02 07 09 10 Test Oracle Test

API Docs API N-Versions N-Versions Mutant-based SpecInference CodeComments Log File Analysis, LogFile Semi-FormalDocs InvariantDetection Regression Testing Testing Regression Metamorphic Testing Testing Metamorphic Pseudo,Partial Oracle SpecMining, NNetworks Specified Test Oracles 81 83 88 89 92 93 94 95 96 97 98 99 00 01 02 04 07 08 09 JML JML FSM FSM OCL OCL UML UML LOFT LOFT IOCO LETO LETO CASL CASL Lustre TTCN-3 DAISTS DAISTS CASCAT CASCAT MSC, VDM MSC, ANNA,LARCH Design-by-Contract DAISTISH, Temporal Temporal DAISTISH, Logic,Model Checking IORL, PROSPER, SDL SDL PROSPER, IORL, H Statecharts, Object-Z HStatecharts, ASML, Alloy, RESOLVE RESOLVE Alloy, ASML, ASTOOT, GIL, TOAS, Z, Z, TOAS, GIL, ASTOOT,

1978 1980 1985 1990 1995 2000 2005 2010 Fig. 3. Chronological introduction of test oracles techniques and concepts.

Specification Inference [194]; empirical studies 4 SPECIFIED TEST ORACLES however followed in late 1990s. Specification is fundamental to computer sci- ence, so it is not surprising that a vast body of research has explored its use as a source of test For implicit test oracles, research into the so- oracle information. This topic could merit an lutions established before 1978 has continued, entire survey on its own right. In this section, but at a slower pace than the other types of test we provide an overview of this work. We also oracles. For handling the lack of an automated include here partial specifications of system test oracle, Partition Testing is a well-known behaviour such as assertions and models. technique that helps a human test oracle select A specification defines, if possible using tests. The trend line suggests that only recently mathematical logic, the test oracle for partic- have new techniques and concepts for tackling ular domain. Thus, a specification language this problem started to emerge, with an explicit is a notation for defining a specified test oracle focus on the human oracle cost problem. D, which judges whether the behaviour of a 9 system conforms to a formal specification. Our (usually strongest) effect the operation has on formalism, defined in Section 2, is, itself, a program state [110]. specification language for specifying test ora- A variety of model-based specification lan- cles. guages exist, including Z [172], B [111], Over the last 30 years, many methods and UML/OCL [31], VDM/VDM-SL [62], Al- formalisms for testing based on formal spec- loy [102], and the LARCH family [71], ification have been developed. They fall into which includes an algebraic specification four broad categories: model-based specifica- sub-language. Broadly, these languages have tion languages, state transition systems, asser- evolved toward being more concrete, closer tions and contracts, and algebraic specifica- to the implementation languages programmers tions. Model-based languages define models use to solve problems. Two reasons explain and a syntax that defines desired behavior in this phenomenon: the first is the effort to in- terms of its effect on the model. State transition crease their adoption in industry by making systems focus on modeling the reaction of a them more familiar to practitioners and the system to stimuli, referred to as “transitions” second is to establish synergies between spec- in this particular formalism. Assertions and ification and implementation that facilitate de- contracts are fragments of a specification lan- velopment as iterative refinement. For instance, guage that are interleaved with statements of Z models disparate entities, like predicates, the implementation language and checked at sets, state properties, and operations, through runtime. Algebraic specifications define equa- a single structuring mechanism, its schema tions over a program’s operations that hold construct; the B method, Z’s successor, pro- when the program is correct. vides a richer array of less abstract language constructs. 4.1 Specification Languages Borger¨ discusses how to use the abstract state machine formalism, a very general set- Specification languages define a mathemati- theoretic specification language geared toward cal model of a system’s behaviour, and are the definition of functions, to define high level equipped with a formal semantics that defines test oracles [29]. The models underlying speci- the meaning of each language construct in fication languages can be very abstract, quite terms of the model. When used for testing, far from concrete execution output. For in- models do not usually fully specify the sys- stance, it may be difficult to compute whether tem, but seek to capture salient properties of a a model’s postcondition for a function permits system so that test cases can be generated from an observed concrete output. If this impedance or checked against them. mismatch can be overcome, by abstracting a system’s concrete output or by concretizing a 4.1.1 Model-Based Specification Languages specification model’s output, and if a specifica- Model-based specification languages model a tion’s postconditions can be evaluated in finite system as a collection of states and operations time, they can serve as a test oracle [4]. to alter these states, and are therefore also Model-based specification languages, such referred to as “state-based specifications” in as VDM, Z, and B can express invariants, the literature [101], [110], [182], [183]. Precon- which can drive testing. Any test case that ditions and postconditions constrain the sys- causes a program to violate an invariant has tem’s operations. An operation’s precondition discovered an incorrect behavior; therefore, imposes a necessary condition over the input these invariants are partial test oracles. states that must hold in a correct application In search of a model-based specification lan- of the operation; a postcondition defines the guage accessible to domain experts, Parnas 10 et al. proposed TOG (Test Oracles Genera- a system they abstract either as a property of tor) from program documentation [143], [146], the states (the final state in the case of Moore [149]. In their method, the documentation is machines) or the transitions traversed (as with written in fully formal tabular expressions in Mealy machines). which the method signature, the external vari- Models approximate a SUT, so behavioral ables, and relation between its start and end differences between the two are inevitable. states are specified [105]. Thus, test oracles can Some divergences, however, are spurious and be automatically generated to check the out- falsely report testing failure. State-transition puts against the specified states of a program. models are especially susceptible to this prob- The work by Parnas et al. has been developed lem when modeling embedded systems, for over a considerable period of more than two which time of occurrence is critical. Recent decades [48], [59], [60], [145], [150], [190], [191]. work model tolerates spurious differences in time by “steering” model’s evaluation: when 4.1.2 State Transition Systems the SUT and its model differ, the model is back- State transition systems often present a graph- tracked, and a steering action, like modifying ical syntax, and focus on transitions be- timer value or changing inputs, is applied to tween different states of the system. Here, reduce the distance, under a similarity mea- states typically abstract sets of concrete state sure [74]. of the modeled system. State transition sys- Protocol conformance testing [72] and, later, tems have been referred as visual languages model-based testing [183] motivated much of in the literature [197]. A wide variety of the work applying state transition systems to state transition systems exist, including Fi- testing. Given a specification F as a state tran- nite State Machines [112], Mealy/Moore ma- sition system, e.g. a finite state machine, a chines [112], I/O Automata [118], Labeled test case can be extracted from sequences of Transition Systems [180], SDL [54], Harel Stat- transitions in F . The transition labels of such echarts [81], UML state machines [28], X- a sequence define an input. A test oracle can Machines [95], [96], Simulink/Stateflow [179] then be constructed from F as follows: if F and PROMELA [97]. Mouchawrab et al. con- accepts the sequence and outputs some value, ducted a rigorous empirical evaluation of test then so should the system under test; if F does oracle construction techniques using state tran- not accept the input, then neither should the sition systems [70], [138]. system under test. An important class of state transition sys- Challenges remain, however, as the defini- tems have a finite set of states and are therefore tion of conformity comes in different flavours, particularly well-suited for automated reason- depending on whether the model is determin- ing about systems whose behaviour can be istic or non-deterministic and whether the be- abstracted into states defined by a finite set haviour of the system under test on a given test of values [93]. State transition systems cap- case is observable and can be interpreted at the ture the behavior of a system under test as 3 same level of abstraction as the model’s. The a set of states , with transitions representing resulting flavours of conformity have been cap- stimuli that cause the system to change state. tured in alternate notions, in terms of whether State transition systems model the output of the system under test is isomorphic to, equiv- 3. Unfortunately, the term ‘state’ has different interpre- alent to, or quasi-equivalent to F. These no- tation in the context of test oracles. Often, it refers to a tions of conformity were defined in the mid- ‘snapshot’ of the configuration of a system at some point 1990s in the famous survey paper by Lee and during its execution; in context of state transition systems, however, ‘state’ typically refers to an abstraction of a set Yannakakis [112] among other notable papers, of configurations, as noted above. including those by Bochmann et al. [26] and 11

Tretmans [180]. method. Moreover, a variety of systems have been independently developed for embedding 4.2 Assertions and Contracts assertions into a host programming languages, An assertion is a boolean expression that is such as Anna [117] for Ada, APP [156] and placed at a certain point in a program to check Nana [120] for C languages. its behaviour at runtime. When an assertion In practice, assertion approaches can check evaluates to true, the program’s behaviour is only a limited set of properties at a certain regarded “as intended” at the point of the point in a program [49]. Languages based on assertion, for that particular execution; when design by contract principles extend the ex- an assertion evaluates to false, an error has pressivity of assertions by providing means been found in the program for that particular to check contracts between client and supplier execution. It is obvious to see how assertions objects in the form of method pre- and post- can be used as a test oracle. conditions and class invariants. Eiffel was the The fact that assertions are embedded in an first language to offer design by contract [136], implementation language has two implications a language feature that has since found its way that differentiate them from specification lan- into other languages, such as Java in the form guages. First, assertions can directly reference of Java modeling language (JML) [140]. and define relations over program variables, Cheon and Leavens showed how to con- reducing the impedance mismatch between struct an assertion-based test oracle on top specification and implementation, for the prop- of JML [45]. For more on assertion-based test erties an assertion can express and check. In oracles, see Coppit and Haddox-Schatz’s eval- this sense, assertions are a natural consequence uation [49], and, later, a method proposed by of the evolution of specification languages to- Cheon [44]. Both assertions and contracts are ward supporting development through iter- enforced observation activity that are embed- ative refinement. Second, they are typically ded into the code. Araujo et al. provide a written along with the code whose runtime systematic evaluation of design by contract on behavior they check, as opposed to preced- a large industrial system [9] and using JML in ing implementation as specification languages particular [8]; Briand et al. showed how to sup- tend to do. port testing by instrumenting contracts [33]. Assertions have a long pedigree dating back to Turing [181], who first identified the need 4.3 Algebraic Specification Languages to separate the tester from the developer and suggested that they should communicate by Algebraic specification languages define a soft- means of assertions: the developer writing ware module in terms of its interface, a sig- them and the tester checking them. Asser- nature consisting of sorts and operation sym- tions gained significant attention as a means bols. Equational axioms specify the required of capturing language semantics in the semi- properties of the operations; their equivalence nal work of Floyd [64] and Hoare [94] and is often computed using term rewriting [15]. subsequently were championed as a means Structuring facilities, which group sorts and of increasing code quality in the development operations, allow the composition of interfaces. of the contract-based programming approach, Typically, these languages employ first-order notably in the language Eiffel [136]. logic to prove properties of the specification, Widely used programming languages now like the correctness of refinements. Abstract routinely provide assertion constructs; for in- data types (ADT), which combine data and stance, C, C++, and Java provide a construct operations over that data, are well-suited to called assert and C# provides a Debug.Assert algebraic specification. 12

One of the earliest algebraic specification ity [52]. They proposed a notation that is suit- systems, for implementing, specifying and test- able for object-oriented programs and devel- ing ADTs, is DAISTS [69]. In this system, oped an algebraic specification language called equational axioms generally equate a term- LOBAS and a test harness called ASTOOT. In rewriting expression in a restricted dialect of addition to handling object-orientation, Frankl ALGOL 60 against a function composition in and Doong require classes to implement the the implementation language. For example, testing method EQN that ASTOOT uses to consider this axiom used in DAISTS: check the equivalence or inequivalence of two instances of a given class. From the vantage Pop2(Stack S, EltType I): point of an observer, an object has observable Pop(Push(S, I)) = if Depth(S) = Limit and unobservable, or hidden, state. Typically, then Pop(S) the observable state of an object is its public fields and method return values. EQN enhances else S; the testability of code and enables ASTOOT This axiom is taken from a specification that to approximate the observational equivalence differentiates the accessor Top, which returns of two objects on a sequence of messages, the top element of a stake without modifying or method calls. When ASTOOT checks the the stack, and the mutator Pop, which returns a equivalence of an object and a specification in new stack lacking the previous top element. A LOBAS, it realizes a specified test oracle. test oracle simply executes both this axiom and Expanding upon ASTOOT, Chen et al. [40] its corresponding composition of implemented [41] built TACCLE, a tool that employs a white- functions against a test suite: if they disagree, a box heuristic to generate a relevant, finite failure has been found in the implementation number of test cases. Their heuristic builds a or in the axiom; if they agree, we gain some data relevance graph that connects two fields assurance of their correctness. of a class if one affects the other. They use Gaudel and her colleagues [19], [20], [72], this graph to consider only that can affect [73] were the first to provide a general test- an observable attributes of a class when con- ing theory founded on algebraic specification. sidering the (in)equivalence of two instances. Their idea is that an exhaustive test suite Algebraic specification has been a fruitful line composed only of ground terms, i.e., terms of research; many algebraic specification lan- with no free variables, would be sufficient guages and tools exist, including Daistish [100], to judge program correctness. This approach LOFT [123], CASL [11], CASCAT [205]. The faces an immediate problem: the domain of projects have been evolving toward testing a each variable in a ground term might be in- wider array of entities, from ADTS, to classes, finite and generate an infinite number of test and most recently, components; they also differ cases. Test suites, however, must be finite, a in their degree of automation of test case gen- practical limitation to which all forms of testing eration and test harness creation. Bochmann et are subject. The workaround is, of course, to al. used LOTOS to realise test oracle functions abandon exhaustive coverage of all bindings from algebraic specifications [184]; most re- of values to ground terms and select a finite cently, Zhu also considered the use of algebraic subset of test cases [20]. specifications as test oracles [210]. Gaudel’s theory focuses on observational equivalence. Observational inequivalence is, 4.3.1 Specified Test Oracle Challenges however, equally important [210]. For this rea- Three challenges must be overcome to build son, Frankl and Doong extended Gaudel’s the- specified test oracles. The first is the lack of a ory to express inequality as well an equal- formal specification. Indeed, the other classes 13 of test oracles, discussed in this survey, all of the class under test [170]. It allows interac- address the problem of test oracle construction tive confirmation from the tester, ensuring that in the absence of a formal specification. Formal the human is in the “’learning loop”. specifications models necessarily rely on ab- The following sections discuss research on straction that can lead to the second problem: deriving test oracles from development arte- imprecision, models that include infeasible be- facts, beginning in Section 5.1 with pseudo- havior or that do not capture all the behav- oracles and N-version programming, which ior relevant to checking a specification [68]. focus on agreement among independent imple- Finally, one must contend with the problem of mentations. Section 5.2 then introduces meta- interpreting model output and equating it to morphic relations which focuses on relations concrete program output. that must hold among distinct executions of Specified results are usually quite abstract, a single implementation. Regression testing, and the concrete test results of a program’s ex- Section 5.3, focuses on relations that should ecutions may not be represented in a form that hold across different versions of the SUT. makes checking their equivalence to the speci- Approaches for inferring models from sys- fied result straightforward. Moreover, specified tem executions, including invariant inference results can be partially represented or over- and specification mining, are described in Sec- simplified. This is why Gaudel remarked that tion 5.4. Section 5.5 closes with a discussion of the existence of a formal specification does research into extracting test oracle information not guarantee the existence of a successful test from textual documentation, like comments, driver [72]. Formulating concrete equivalence specifications, and requirements. functions may be necessary to correctly inter- pret results [119]. In short, solutions to this 5.1 Pseudo-Oracles problem of equivalence across abstraction lev- els depend largely on the degree of abstraction One of the earliest versions of a derived test and, to a lesser extent, on the implementation oracle is the concept of a pseudo-oracle, intro- of the system under test. duced by Davis and Weyuker [50], as a means of addressing so-called non-testable programs: “Programs which were written in or- 5 DERIVED TEST ORACLES der to determine the answer in the first place. There would be no need A derived test oracle distinguishes a system’s to write such programs, if the correct correct from incorrect behavior based on in- answer were known.” [196]. formation derived from various artefacts (e.g. documentation, system executions) or proper- A pseudo-oracle is an alternative version of ties of the system under test, or other versions the program produced independently, e.g. by of it. Testers resort to derived test oracles when a different programming team or written in specified test oracles are unavailable, which an entirely different programming language. is often the case, since specifications rapidly In our formalism (Section 2), a pseudo-oracle fall out of date when they exist at all. Of is a test oracle D that accepts test activity course, the derived test oracle might become sequences of the form a partial “specified test oracle”, so that test f1(x)o1f2(x)o2 :[f1 6= f2 ∧ o1 = o2], (1) oracles derived by the methods discussed in this section could migrate, over time, to be- where f1, f2 ∈ C, the components of the SUT come, those considered to be the “specified test (Section 2), are alternative, independently pro- oracles” of the previous section. For example, duced, versions of the SUT on the same value. JWalk incrementally learns algebraic properties We draw the reader’s attention to the similarity 14 between pseudo-oracles and algebraic speci- tions of f that we expect to hold across multiple fication systems (Section 4.3), like DIASTIS, executions of p. Suppose f(x) = ex, then where the function composition expression in eae−a = 1 is a metamorphic relation. Under the implementation language and the term- this metamorphic relation, p(0.3) * p(-0.3) rewriting expression are distinct implementa- = 1 will hold if p is correct [43]. The key idea tions whose output must agree and form a is that reasoning about the properties of f will pseudo-oracle. lead us to relations that its implementation p A similar idea exists in fault-tolerant com- must obey. puting, referred to as multi- or N-version pro- Metamorphic testing is a process of exploiting gramming [13], [14], where the software is metamorphic relations to generate partial test implemented in multiple ways and executed oracles for follow-up test cases: it checks im- in parallel. Where results differ at run-time, a portant properties of the SUT after certain test “voting” mechanism decides which output to cases are executed [36]. Although metamorphic use. In our formalism, an N-version test oracle relations are properties of the ground truth, the accepts test activities of the following form: correct phenomenon (f in the example above) that a SUT seeks to implement and could be f1(x)o1f2(x)o2 ··· fk(x)ok : considered a mechanism for creating specified [∀i, j ∈ [1..k], i 6= j ⇒ fi 6= fj (2) test oracles. We have placed them with derived ∧ arg max m(oi) ≥ t] test oracles, because, in practice, metamorphic oi relations are usually manually inferred from a In Equation 2, the outputs form a multiset white-box inspection of a SUT. and m is the multiplicity, or number of repeti- Metamorphic relations differ from algebraic tions of an element in the multiset. The arg max specifications in that a metamorphic relation operator finds the argument that maximizes a relates different executions, not necessarily on function’s output, here an output with greatest the same input, of the same implementation multiplicity. Finally, the maximum multiplicity relative to its specification, while algebraic is compared against the threshold t. We can specifications equates two distinct implementa- now define a N-version test oracle as Dnv(w, x) tions of the specification, one written in an im- where w obeys Equation 2 with t bound to x. plementation language and the other written k Then Dmaj(w) = Dnv(w, d 2 e) is an N-version in formalism free of implementation details, oracle that requires a majority of the outputs usually term rewriting [15]. to agree and Dpso(w) = Dnv(w, k) generalizes Under the formalism of Section 2, a meta- pseudo oracles to agreement across k imple- morphic relation is mentations. More recently, Feldt [58] investigated the f(x1)o1f(x2)o2 ··· f(ik)ok :[expr ∧ k ≥ 2], possibility of automatically producing differ- where expr is a constraint, usually arithmetic, ent versions using genetic programming, and over the inputs x and o . This definition McMinn [128] explored the idea of producing i x makes clear that a metamorphic relation is different software versions for testing through a constraint on the values of stimulating the program transformation and the swapping of single SUT f at least twice, observing the different software elements with those of a responses, and imposing a constraint on how similar specification. they interrelate. In contrast, algebraic specifi- cation is a type of pseudo-oracle, as specified 5.2 Metamorphic Relations in Equation 1, which stimulates two distinct For the SUT p that implements the function f, implementations on the same value, requiring a metamorphic relation is a relation over applica- their output to be equivalent. 15

It is often thought that metamorphic rela- opment environment might track relationships tions need to concern numerical properties that among the output of test cases run during de- can be captured by arithmetic equations, but velopment, and propose ones that hold across metamorphic testing is, in fact, more general. many runs to the developer as possible meta- For example, Zhou et al. [209] used meta- morphic relations. Work has already begun morphic testing to test search engines such that exploits domain knowledge to formulate as Google and Yahoo!, where the relations metamorphic relations [38], but it is still at an considered are clearly non-numeric. Zhou et early stage and not yet automated. al. build metamorphic relations in terms of the consistency of search results. A motivating example they give is of searching for a paper 5.3 Regression Test Suites in the ACM digital library: two attempts, the Regression testing aims to detect whether second quoted, using advanced search fail, but the modifications made to the new version a general search identical to the first succeeds. of a SUT have disrupted existing functional- Using this insight, the authors build metamor- ity [204]. It rests on the implicit assumption phic relations, like ROR : A1 = (A2 ∪ A3) ⇒ that the previous version can serve as an oracle |A2| ≤ |A1|, where the Ai are sets of web pages for existing functionality. returned by queries. Metamorphic testing is For corrective modifications, desired func- also means of testing Weyuker’s “non-testable tionality remains the same so the test oracle programs”, introduced in the last section. for version i, Di, can serve as the next ver- When the SUT is nondeterministic, such as sion’s test oracle, Di+1. Corrective modifica- a classifier whose exact output varies from run tions may fail to correct the problem they seek to run, defining metamorphic relations solely to address or disrupt existing functionality; test in terms of output equality is usually insuffi- oracles may be constructed for these issues cient during metamorphic testing. Murphy et by symbolically comparing the execution of al. [139], [140] investigate relations other than the faulty version against the newer, allegedly equality, like set intersection, to relate the out- fixed version [79]. Orstra generates assertion- put of stochastic machine learning algorithms, based test oracles by observing the program such as classifiers. Guderlei and Mayer intro- states of the previous version while executing duced statistical metamorphic testing, where the regression test suite [199]. The regression the relations for test output are checked us- test suite, now augmented with assertions, is ing statistical analysis [80], a technique later then applied to the newer version. Similarly, exploited to apply metamorphic testing to spectra-based approaches use the program and stochastic optimisation algorithms [203]. value spectra obtained from the original ver- The biggest challenge in metamorphic test- sion to detect regression faults in the newer ing is automating the discovery of metamor- versions [86], [200]. phic relations. Some of those in the literature For perfective modifications, those that add are mathematical [36], [37], [42] or combina- new features to the SUT, Di must be modi- torial [139], [140], [161], [203]. Work on the fied to cater for newly added behaviours, i.e. discovery of algebraic specifications [88] and Di+1 = Di ∪∆D. Test suite augmentation tech- JWalk’s lazy systematic unit testing, in which niques specialise in identifying and generating the specification is lazily, and incrementally, ∆D [6], [132], [202]. However, more work is learned through interactions between JWalk required to develop these augmentation tech- and the developer [170] might be suitable for niques so that they augment, not merely the adaptation to the discovery metamorphic re- test input, but also the expected output. In this lations. For instance, the programmer’s devel- way, test suite augmentation could be extended 16 to augment the existing oracles as well as the 5.4.1 Invariant Detection test data. Program behaviours can be automatically Changes in the specification, which is checked against invariants. Thus, invariants deemed to fail to meet requirements per- can serve as test oracles to help determine the haps because the requirements have them- correct and incorrect outputs. selves changed, drives another class of modifi- When invariants are not available for a pro- cations. These changes are generally regarded gram in advance, they can be learned from the as “perfective” maintenance in the literature program (semi-) automatically. A well-known but no distinction is made between perfections technique proposed by Ernst et al. [56], imple- that add new functionality to code (without mented in the Daikon tool [55], is to execute a changing requirements) and those changes that program on a collection of inputs (test cases) arise due to changed requirements (or incorrect against a collection of potential invariants. The specifications). invariants are instantiated by binding their Our formalisation of test oracles in Section 2 variables to the program’s variables. Daikon forces a distinction of these two categories then dynamically infers likely invariants from of perfective maintenance, since the two have those invariants not violated during the pro- profoundly different consequences for test or- gram executions over the inputs. The inferred acles. We therefore refer to this new category invariants capture program behaviours, and of perfective maintenance as “changed require- thus can be used to check program correct- ments”. Recall that, for the function f : X → Y , ness. For example, in regression testing, invari- dom(f) = X. For changed requirements: ants inferred from the previous version can be checked as to whether they still hold in the ∃α · Di+1(α) 6= Di(α), new version. In our formalism, Daikon invariant detec- which implies, of course, dom(D ) ∩ i+1 tion can define an unsound test oracle that dom(D ) 6= ∅ and the new test oracle cannot i gathers likely invariants from the prefix of a simply union the new behavior with the old testing activity sequence, then enforces those test oracle. Instead, we have invariants over its suffix. Let Ij be the set of ( j I ∆D if α ∈ dom(∆D) likely invariants at observation ; 0 are the Di+1 = initial invariants; for the test activity sequence Di otherwise. r1r2 ··· rn, In = {x ∈ I | ∀i ∈ [1..n], ri |= x}, where |= is logical entailment. Thus, we take an observation to define a binding of the 5.4 System Executions variables in the world under which a likely A system execution trace can be exploited to invariant either holds or does not: only those derive test oracles or to reduce the cost of likely invariants remain that no observation a human test oracle by aligning an incorrect invalidates. In the suffix rn+1rn+2 ··· rm, the execution against the expected execution, as test oracle then changes gear and accepts only expressed in temporal logic [51]. This section those activities whose response observations discusses the two main techniques for deriv- obey In, i.e. ri :[ri |= In], i > n. ing test oracles from traces — invariant de- Invariant detection can be computationally tection and specification mining. Derived test expensive, so incremental [22], [171] and light oracles can be built on both techniques to au- weight static analyses [39], [63] have been tomatically check expected behaviour similar brought to bear. A technical report summarises to assertion-based specification, discussed in various dynamic analysis techniques [158]. Section 4.2. Model inference [90], [187] could also be re- 17 garded as a form of invariant generation in to fit the given behaviour. Weyuker defined which the invariant is expressed as a model this relation for assessing test adequacy which (typically as an FSM). Ratcliff et al. used can be stated informally as follows. Search-Based (SBSE) [84] A set of I/O pairs T is an inference adequate to search for invariants, guided by mutation test set for the program P intended to fulfil testing [154]. specification S iff the program IT inferred from The accuracy of inferred invariants depends T (using some inference procedure) is equiva- in part on the quality and completeness of the lent to both P and S. Any difference would test cases; additional test cases might provide imply that the inferred program is not equiva- new data from which more accurate invariants lent to the actual program and, therefore, that can be inferred [56]. Nevertheless, inferring the test set T used to infer the program P is “perfect” invariants is almost impossible with inadequate. the current state of the art, which tends to fre- This inference procedure mainly depends quently infer incorrect or irrelevant invariants upon the set of I/O pairs used to infer be- [152]. Wei et al. recently leveraged existing con- haviours. These pairs can be obtained from tracts in Eiffel code to infer postconditions on system executions either passively, e.g., by run- commands (as opposed to queries) involving time monitoring, or actively, e.g., by querying quantification or implications whose premises the system [106]. However, equivalence check- are conjunctions of formulae [192], [193]. ing is undecidable in general, and therefore Human intervention can, of course, be used inference is only possible for programs in a to filter the resulting invariants, i.e., retaining restricted class, such as those whose behaviour the correct ones and discarding the rest. How- can be modelled by finite state machines [194]. ever, manual filtering is error-prone and the With this, equivalence can be accomplished by misclassification of invariants is frequent. In a experiment [89]. Nevertheless, serious practical recent empirical study, Staats et al. found that limitations are associated with such experi- half of the incorrect invariants Daikon inferred ments (see the survey by Lee and Yannakakis from a set of Java programs were misclassi- [112] for complete discussion). fied [175]. Despite these issues, research on the The marriage between inference and testing dynamic inference of program invariants has has produced wealth of techniques, especially exhibited strong momentum in the recent past in the context of “black-box” systems, when with the primary focus on its application to test source code/behavioural models are unavail- generation [10], [142], [207]. able. Most work has applied L∗, a well-known learning algorithm, to learn a black-box sys- 5.4.2 Specification Mining tem B as a finite state machine (FSM) with Specification mining or inference infers a for- n states [7]. The algorithm infers an FSM by mal model of program behaviour from a set iteratively querying B and observing the cor- of observations. In terms of our formalism, a responding outputs. A string distinguishes two test oracle can enforce these formal models FSMs when only one of the two machines ends over test activities. In her seminal work on in a final state upon consuming the string. At using inference to assess test data adequacy, each iteration, an inferred model Mi with i < n Weyuker connected inference and testing as in- states is given. Then, the model is refined with verse processes [194]. The testing process starts the help of a string that distinguishes B and with a program, and looks for I/O pairs that Mi to produce a new model, until the number characterise every aspect of both the intended of states reaches n. and actual behaviours, while inference starts Lee and Yannakakis [112] showed how to with a set of I/O pairs, and derives a program use L∗ for conformance testing of B with a 18 specification S. Suppose L∗ starts by inferring which has been evaluated by treating the cur- a model Mi, then we compute a string that rent version of the SUT as correct, inferring the distinguishes Mi from S and refine Mi through specification, and then executing the generated the algorithm. If, for i = n, Mn is S, then we test inputs. Artificial Neural Networks have declare B to be correct, otherwise faulty. also been applied to learn system behaviour Apart from conformance testing, inference and detect deviations from it [163], [164]. techniques have been used to guide test gen- The majority of specification mining tech- eration to focus on particular system behavior niques adopt Finite State Machines as the out- and to reduce the scope of analysis. For exam- put format to capture the functional behaviour ple, Li et al. applied L∗ to the integration test- of the SUT [21], [27], [53], [77], [78], [89], [112], ing of a system of black-box components [114]. [114], [135], [148], [166], [168], [189], some- Their analysis architecture derives a test oracle times extended with temporal constraints [188] from a test suite by using L∗ to infer a model or data constraints [115], [122] which are, in of the systems from dynamically observing turn, inferred by Daikon [56]. Buchi¨ automata system’s behavior; this model is then searched have been used to check properties against to find incorrect behaviors, such as deadlocks, black-box systems [148]. Annotated call trees and used to verify the system’s behaviour un- have been used to represent the program be- der fuzz testing (Section 6). haviour of different versions in the regression To find concurrency issues in asynchronous testing context [200]. GUI widgets have been black-box systems, Groz et al. proposed an directly modelled with objects and properties approach that extracts behavioural models for testing [133], [134], [198]. Artificial Neu- from systems through active learning tech- ral Nets and machine learning classifiers have niques [78] and then performs reachability been used to learn the expected behaviour of analysis on the models [27] to detect issues, SUT [67], [163], [164]. For dynamic and fuzzy notably races. behaviours such as the result of web search Further work in this context has been com- engine queries, association rules between in- piled by Shahbaz [166] with industrial ap- put (query) and output (search result strings) plications. Similar applications of inference have been used as the format of an inferred can be found in system analysis [21], [78], oracle [208]. [135], [188], [189], component interaction test- ing [115], [122], regression testing [200], se- 5.5 Textual Documentation curity testing [168] and verification [53], [77], Textual documentation ranges from natural [148]. language descriptions of requirements to struc- Zheng et al. [208] extract item sets from web tured documents detailing the functionalities search queries and their results, then apply of APIs. These documents describe the func- association rule mining to infer rules. From tionalities expected from the SUT to varying these rules, they construct derived test ora- degrees, and can therefore serve as a basis cles for web search engines, which had been for generating test oracles. They are usually thought to be untestable. Image segmentation informal, intended for other humans, not to delineates objects of interest in an image; im- support formal logical or mathematical rea- plementing segmentation programs is a te- soning. Thus, they are often partial and am- dious, iterative process. Frouchni et al. success- biguous, in contrast to specification languages. fully apply semi-supervised machine learning Their importance for test oracle construction to create test oracles for image segmentation rests on the fact that developers are more likely programs [67]. Memon et al. [133], [134], [198] to write them than formal specifications. In introduced and developed the GUITAR tool, other words, the documentation defines the 19 constraints that the test oracle D, as defined vocabulary with minimal ambiguity. This, in in Section 2, enforces over testing activities. turn, eases the interpretation of documents and At first sight, it may seem impossible to makes the automatic derivation of test oracles derive test oracles automatically because nat- possible. The researchers who have proposed ural languages are inherently ambiguous and specification languages based on (semi-) formal textual documentation is often imprecise and subsets of a natural language are motivated inconsistent. The use of textual documentation by the fact that model-based specification lan- has often been limited to humans in practical guages have not seen wide-spread adoption, testing applications [144]. However, some par- and believe the reason is the inaccessibility tial automation can assist the human in testing their formalism and set-theoretic underpin- using documentation as a source of test oracle nings to the average programmer. information. Schwitter introduced a computer- Two approaches have been explored. The processable, restricted natural language first category builds techniques to construct a called PENG [160]. It covers a strict subset of formal specification out of an informal, tex- standard English with a restricted grammar tual artefact, such as an informal textual spec- and a domain specific lexicon for content ification, user and developer documentation, words and predefined function words. and even source code comments. The second Documents written in PENG can be translated restricts a natural language to a semi-formal deterministically into first-order predicate fragment amenable to automatic processing. logic. Schwitter et al. [30] provided guidelines Next, we present representative examples of for writing test scenarios in PENG that can each approach. automatically judge the correctness of program behaviours. 5.5.1 Converting Text into Specifications Prowell and Poore [153] introduced a se- quential enumeration method for developing 6 IMPLICIT TEST ORACLES a formal specification from an informal one. An implicit test oracle is one that relies on The method systematically enumerates all se- general, implicit knowledge to distinguish be- quences from the input domain and maps the tween a system’s correct and incorrect be- corresponding outputs to produce an arguably haviour. This generally true implicit knowl- complete, consistent, and correct specification. edge includes such facts as “buffer overflows However, it can suffer from an exponential and segfaults are nearly always errors”. The explosion in the number of input/output se- critical aspect of an implicit test oracle is that it quences. Prowell and Poore employ abstraction requires neither domain knowledge nor a for- techniques to control this explosion. The end mal specification to implement, and it applies result is a formal specification that can be to nearly all programs. transferred into a number of notations, e.g., Implicit test oracle can be built on any proce- state transition systems. A notable benefit of dure that detects anomalies such as abnormal this approach is that it tends to discover many termination due to a crash or an execution fail- inconsistent and missing requirements, making ure [34], [167]. This is because such anomalies the specification more complete and precise. are blatant faults; that is, no more information is required to ascertain whether the program 5.5.2 Restricting Natural Language behaved correctly or not. Under our formalism, Restrictions on a natural language reduce com- an implicit oracle defines a subset of stimulus plexities in its grammar and lexicon and allow and response relations as guaranteed failures, the expression of requirements in a concise in some context. 20

Implicit test oracles are not universal. Be- and Tonella [155] considered a subset of the haviours abnormal for one system in one con- anomalies that Web applications can harbor, text may be normal for that system in a dif- such as navigation problems, hyperlink incon- ferent context or normal for a different system. sistencies, etc. In their empirical study, 60% of Even crashing may be considered acceptable, or the Web applications exhibited anomalies and even desired behaviour, as in systems designed execution failures. to find crashes. Research on implicit oracles is evident from 7 THE HUMAN ORACLE PROBLEM early work in software engineering. The very The above sections give solutions to the test first work in this context was related to dead- oracle problem when some artefact exists that lock, livelock and race detection to counter can serve as the foundation for either a full or system concurrency issues [24] [107] [185] partial test oracle. In many cases, however, no [16] [169]. Similarly, research on testing non- such artefact exists so a human tester must ver- functional attributes have garnered much at- ify whether software behaviour is correct given tention since the advent of the object-oriented some stimuli. Despite the lack of an automated paradigm. In performance testing, system test oracle, software engineering research can throughput metrics can highlight degradation still play a key role: finding ways to reduce errors [121], [124], as when a server fails to the effort that the human tester has to expend respond when a number of requests are sent in directly creating, or in being, the test oracle. simultaneously. A case study by Weyuker and This effort is referred to as the Human Oracle Vokolos showed how a process with excessive Cost [126]. It aims to reduce the cost of human CPU usage caused service delays and disrup- involvement along two dimensions: 1) writing tions [195]. Similarly, test oracles for memory test oracles and 2) evaluating test outcomes. leaks can be built on a profiling technique that Concerning the first dimension, the work of detects dangling references during the run of Staats et al. is a representative. They seek to a program [12], [57], [87], [211]. For example, reduce the human oracle cost by guiding hu- Xie and Aiken proposed a boolean constraint man testers to those parts of the code they need system to represent the dynamically allocated to focus on when writing test oracles [173]. objects in a program [201]. Their system raises This reduces the cost of test oracle construc- an alarm when an object becomes unreachable tion, rather than reducing the cost of a human but has not yet been deallocated. involvement in testing in the absence of an au- Fuzzing is an effective way to find implicit tomated test oracle. Additional recent work on anomalies, such a crashes [137]. The main idea test oracle construction includes Dodona, a tool is to generate random, or “fuzz”, inputs and that suggests oracle data to a human who then feed them to the system to find anomalies. decides whether to use it to define a test oracle This works because the implicit specification realized as a Java unit test [116]. Dodona infers usually holds over all inputs, unlike explicit relations among program variables during ex- specifications which tend to relate subsets of ecution, using network centrality analysis and inputs to outputs. If an anomaly is detected, data flow. the fuzz tester reports it along with the input Research that seeks to reduce the human that triggers it. Fuzzing is commonly used to oracle cost broadly focuses on finding a quan- detect security vulnerabilities, such as buffer titative reduction in the amount of work the overflows, memory leaks, unhandled excep- tester has to do for the same amount of test tions, denial of service, etc. [18], [177]. coverage or finding a qualitative reduction in Other work has focused on developing pat- the work needed to understand and evaluate terns to detect anomalies. For instance, Ricca test cases. 21

7.1 Quantitative Human Oracle Cost programs and small benchmarks, they found MOCell performed best. Test suites can be unnecessarily large, covering Taylor et al. [178] use an inferred model as few test goals in each individual test case. a semantic test oracle to shrink a test suite. Additionally, the test cases themselves may be Fraser and Arcuri [65] generate test suites for unnecessarily long — for example containing Java using their EvoSuite tool. By generating large numbers of method calls, many of which the entire suite at once, they are able to simulta- do not contribute to the overall test case. The neously maximize coverage and minimize test goal of quantitative human oracle cost reduction suite size, thereby aiding human oracles and is to reduce test suite and test case size so alleviating the human oracle cost problem. as to maximise the benefit of each test case and each component of that test case. This consequently reduces the amount of manual 7.1.2 Test Case Reduction checking effort that is required on behalf of When using randomised algorithms for gen- a human tester performing the role of a test erating test cases for object-oriented systems, oracle. Cast in terms of our formalism, quan- individual test cases can generate very long titative reduction aims to partition the set of traces very quickly — consisting of a large test activity sequences so the human need only number of method calls that do not actually consider representative sequences, while test contribute to a specific test goal (e.g. the cov- case reduction aims to shorten test activity erage of a particular branch). Such method sequences. calls unnecessarily increase test oracle cost, so Leitner et al. remove such calls [113] us- 7.1.1 Test Suite Reduction ing Zeller’s and Hildebrandt’s Delta Debug- ging [206]. JWalk simplifies test sequences by Traditionally, test suite reduction has been ap- removing side-effect free functions from them, plied as a post-processing step to an existing thereby reducing test oracle costs where the test suite, e.g. the work of Harrold et al, [85], human is the test oracle [170]. Quick tests Offutt et al. [141] and Rothermel et al. [157]. Re- seek to efficiently spend a small test budget cent work in the search-based testing literature by building test suites whose execution is fast has sought to combine test input generation enough for it to be run after compilations [76]. and test suite reduction into one phase to These quick tests must be likely to trigger bugs produce smaller test suites. and therefore generate short traces, which, as Harman et al. proposed a technique for gen- a result, are easier for humans to comprehend. erating test cases that penetrate the deepest levels of the control dependence graph for the program, in order to create test cases that 7.2 Qualitative Human Oracle Cost exercise as many elements of the program as Human oracle costs may also be minimised possible [82]. Ferrer et al. [61] attack a multi- from a qualitative perspective. That is, the objective version of the problem in which they extent to which test cases, more generally test- sought to simultaneously maximize branch ing activities, may be easily understood and coverage and minimize test suite size; their processed by a human. The input profile of focus was not this problem per se, but its use a SUT is the distribution of inputs it actually to compare a number of multi-objective opti- processes when running in its operational en- misation algorithms, including the well-known vironment. Learning an input profile requires Non-dominated Sorting Genetic Algorithm II domain knowledge. If such domain knowledge (NSGA-II), Strength Pareto EA 2 (SPEA2), and is not built into the test data generation pro- MOCell. On a series of randomly-generated cess, machine-generated test data tend to be 22 drawn from a different distribution over the here the seminal trade-off be- SUT’s inputs than its input profile. While this tween breadth first and depth first search in the may be beneficial for trapping certain types choice between fuzz testing with nonsensical of faults, the utility of the approach decreases inputs and testing with realistic inputs. when test oracle costs are taken into account, Bozkurt and Harman, introduced the idea of since the tester must invest time comprehending mining web services for realistic test inputs, the scenario represented by test data in order to using the outputs of known and trusted test correctly evaluate the corresponding program services as more realistic inputs to the service output. Arbitrary inputs are much harder to under test [32]. The idea is that realistic test understand than recognisable pieces of data, cases are more likely to reveal faults that de- thus adding time to the checking process. velopers care about and yield test cases that All approaches to qualitatively alleviating are more readily understood. McMinn et al. the human oracle cost problem incorporate also mine the web for realistic test cases. They human knowledge to improve the understand- proposed mining strings from the web to assist ability of test cases. The three approaches we in the test generation process [130]. Since web cover are 1) augmenting test suites designed by page content is generally the result of human the developers; 2) computing “realistic” inputs effort, the strings contained therein tend to be from web pages, web services, and natural real words or phrases with high degrees of language; and 3) mining usage patterns to semantic and domain relevant context that can replicate them in the test cases. thus be used as sources of realistic test data. In order to improve the readability of Afshan et al. [1] combine a natural language automatically-generated test cases, McMinn et model and metahueristics, strategies that guide al. propose the incorporation of human knowl- a search process [25], to help generate read- edge into the test data generation process [126]. able strings. The language model scores how With search-based approaches, they proposed likely a string is to belong to a language based injecting this knowledge by “seeding” the algo- on the character combinations. Incorporating rithm with test cases that may have originated this probability score into a fitness function, from a human source such as a “sanity check” a metaheuristic search can not only cover test performed by the programmer, an already ex- goals, but generate string inputs that are more isting, partial test suite, or input–output ex- comprehensible than the arbitrary strings gen- amples generated by programming paradigms erated by the previous state of the art. Over that involve the developer in computation, like a number of case studies, Afshan et al. found prorogued programming [2]. that human oracles more accurately and more The generation of string test data is par- quickly evaluated their test strings. ticularly problematic for automatic test data Fraser and Zeller [66] improve the familiarity generators, which tend to generate nonsensi- of test cases by mining the software under test cal strings. These nonsensical strings are, of for common usage patterns of APIs. They then course, a form of fuzz testing (Section 6) and seek to replicate these patterns in generated good for exploring uncommon, shallow code test cases. In this way, the scenarios generated paths and finding corner cases, but they are are more likely to be realistic and represent unlikely to exercise functionality deeper in a actual usages of the software under test. program’s control flow. This is because string comparisons in control expressions are usually 7.3 Crowdsourcing the Test Oracle stronger than numerical comparisons, making one of a control point’s branches much less A recent approach to handling the lack of a likely to traverse via uniform fuzzing. We see test oracle is to outsource the problem to an 23 online service to which large numbers of peo- Two promising approaches to oracle reuse are ple can provide answers — i.e., through crowd- generalizations of reliable reset and the sharing sourcing. Pastore et al. [147] demonstrated the of oracular data across software product lines feasibility of the approach but noted problems (SPLs). Generalizing reliable reset to arbitrary in presenting the test problem to the crowd states allows the interconnection of different such that it could be easily understood, and the versions of a program, so we can build test need to provide sufficient code documentation oracles that based on older versions of a pro- so that the crowd could determine correct out- gram, using generalized reliable reset to ignore puts from incorrect ones. In these experiments, or handle new inputs and functionality. SPLs crowdsourcing was performed by submitting are sets of related versions of a system [47]. tasks to a generic crowdsourcing platform — A product line can be thought of as a tree of Amazon’s Mechanical Turk4. However, some related software products in which branches dedicated crowdsourcing services now exist contain new alternative versions of the system, for the testing of mobile applications. They each of which shares some core functionality specifically address the problem of the explod- enjoyed by a base version. Research on test ing number of devices on which a mobile oracles should seek to leverage these SPL trees application may run, and which the developer to define trees of test oracles that share oracular or tester may not own, but which may be data where possible. possessed by the crowd at large. Examples Work has already begun on using test ora- of these services include Mob4Hire5, MobTest6 cle as the measure of how well the program and uTest7. has been tested (a kind of test oracle cover- age) [104], [176], [186] and measures of oracles such as assessing the quality of assertions [159]. 8 FUTURE DIRECTIONSAND CON- More work is needed. “Oracle metrics” is a CLUSION challenge to, and an opportunity for, the “soft- This paper has provided a comprehensive sur- ware metrics” community. In a world in which vey of test oracles, covering specified, derived test oracles become more prevalent, it will be and implicit oracles and techniques that cater important for testers to be able to assess the for the absence of test oracles. The paper has features offered by alternative test oracles. also analyzed publication trends in the test ora- A repository of papers on test oracles accom- cle domain. This paper has necessarily focused panies this paper at http://crestweb.cs.ucl.ac. on the traditional approaches to the test oracle uk/resources/oracle repository. problem. Much work on test oracles remains to be done. In addition to research deepening and interconnecting these approaches, test oracle 9 ACKNOWLEDGEMENTS problem is open to new research directions. We would like to thank Bob Binder for helpful We close with a discussion of two of these that information and discussions when we began we find noteworthy and promising: test oracle work on this paper. We would also like thank reuse and test oracle metrics. all who attended the CREST Open Workshop As this survey has shown, test oracles are on the Test Oracle Problem (21–22 May 2012) at difficult to construct. Oracle reuse is therefore University College London, and gave feedback an important problem that merits attention. on an early presentation of the work. We are further indebted to the very many responses 4. http://www.mturk.com 5. http://www.mob4hire.com to our emails from authors cited in this survey, 6. http://www.mobtest.com who provided several useful comments on an 7. http://www.utest.com/ earlier draft of our paper. 24

REFERENCES tolerant software. IEEE Transactions on Software En- gineering, 11:1491–1501, 1985. [14] A. Avizienis and L. Chen. On the implementation of [1] Sheeva Afshan, Phil McMinn, and Mark Stevenson. N-version programming for software fault-tolerance Evolving readable string test inputs using a natural during execution. In Proceedings of the First Inter- language model to reduce human oracle cost. In In- national Computer Software and Application Conference ternational Conference on Software Testing, Verification (COMPSAC ’77), pages 149–155, 1977. and Validation (ICST 2013). IEEE, March 2013. [15] Franz Baader and Tobias Nipkow. Term Rewriting [2] Mehrdad Afshari, Earl T. Barr, and Zhendong Su. and All That. Cambridge University Press, New York, Liberating the programmer with prorogued pro- NY, USA, 1998. gramming. In Proceedings of the ACM international [16] A. F. Babich. Proving total correctness of parallel symposium on New ideas, new paradigms, and reflections programs. IEEE Trans. Softw. Eng., 5(6):558–574, on programming and software, Onward! ’12, pages 11– November 1979. 26, New York, NY, USA, 2012. ACM. Track at [17] Luciano Baresi and Michal Young. Test oracles. Tech- OOPSLA/SPLASH’12. nical Report CIS-TR-01-02, University of Oregon, [3] Wasif Afzal, Richard Torkar, and Robert Feldt. A Dept. of Computer and Information Science, August systematic review of search-based testing for non- 2001. http://www.cs.uoregon.edu/∼michal/pubs/ Information and Soft- functional system properties. oracles.html. ware Technology, 51(6):957–976, 2009. [18] Sofia Bekrar, Chaouki Bekrar, Roland Groz, and [4] Bernhard K. Aichernig. Automated black-box testing Laurent Mounier. Finding software vulnerabilities with abstract VDM oracles. In SAFECOMP, pages by smart fuzzing. In ICST, pages 427–430, 2011. 250–259. Springer-Verlag, 1999. [19] Gilles Bernot. Testing against formal specifications: [5] Shaukat Ali, Lionel C. Briand, Hadi Hemmati, and a theoretical view. In Proceedings of the International Rajwinder Kaur Panesar-Walawege. A systematic Joint Conference on Theory and Practice of Software review of the application and empirical investigation Development on Advances in Distributed Computing of search-based test-case generation. IEEE Transac- (ADC) and Colloquium on Combining Paradigms for tions on Software Engineering, pages 742–762, 2010. Software Development (CCPSD): Vol. 2, TAPSOFT ’91, [6] Nadia Alshahwan and Mark Harman. Automated pages 99–119, New York, NY, USA, 1991. Springer- session data repair for web application regression Verlag New York, Inc. testing. In Proceedings of 2008 International Conference [20] Gilles Bernot, Marie Claude Gaudel, and Bruno on Software Testing, Verification, and Validation, pages Marre. Software testing based on formal specifica- 298–307. IEEE Computer Society, 2008. tions: a theory and a tool. Softw. Eng. J., 6(6):387–405, [7] Dana Angluin. Learning regular sets from queries November 1991. and counterexamples. Inf. Comput., 75(2):87–106, [21] Antonia Bertolino, Paola Inverardi, Patrizio Pellic- 1987. cione, and Massimo Tivoli. Automatic synthesis of [8] W. Araujo, L.C. Briand, and Y. Labiche. Enabling behavior protocols for composable web-services. In the runtime assertion checking of concurrent con- Proceedings of ESEC/SIGSOFT FSE, ESEC/FSE 2009, tracts for the java modeling language. In Software pages 141–150, 2009. Engineering (ICSE), 2011 33rd International Conference [22] D. Beyer, T. Henzinger, R. Jhala, and R. Majumdar. on, pages 786–795, 2011. Checking memory safety with Blast. In M. Cerioli, [9] W. Araujo, L.C. Briand, and Y. Labiche. On the editor, Fundamental Approaches to Software Engineer- effectiveness of contracts as test oracles in the detec- ing, 8th International Conference, FASE 2005, Held as tion and diagnosis of race conditions and deadlocks Part of the Joint European Conferences on Theory and in concurrent object-oriented software. In Empirical Practice of Software, ETAPS 2005, Edinburgh, UK, April Software Engineering and Measurement (ESEM), 2011 4-8, 2005, Proceedings, volume 3442 of Lecture Notes International Symposium on, pages 10–19, 2011. in Computer Science, pages 2–18. Springer, 2005. [10] Shay Artzi, Michael D. Ernst, Adam Kiezun,˙ Carlos [23] R. Binder. Testing Object-Oriented Systems: Models, Pacheco, and Jeff H. Perkins. Finding the needles Patterns, and Tools. Addison-Wesley, 2000. in the haystack: Generating legal test inputs for [24] A. Blikle. Proving programs by sets of compu- object-oriented programs. In 1st Workshop on Model- tations. In Mathematical Foundations of Computer Based Testing and Object-Oriented Systems (M-TOOS), Science, pages 333–358. Springer, 1975. Portland, OR, October 23, 2006. [25] Christian Blum and Andrea Roli. Metaheuristics in [11] Egidio Astesiano, Michel Bidoit, Hel´ ene` Kirchner, combinatorial optimization: Overview and concep- Bernd Krieg-Bruckner,¨ Peter D. Mosses, Donald San- tual comparison. ACM Computing Surveys (CSUR), nella, and Andrzej Tarlecki. CASL: the common 35(3):268–308, 2003. algebraic specification language. Theor. Comput. Sci., [26] Gregor V. Bochmann and Alexandre Petrenko. Pro- 286(2):153–196, 2002. tocol testing: review of methods and relevance for [12] Todd M. Austin, Scott E. Breach, and Gurindar S. software testing. In Proceedings of the 1994 ACM Sohi. Efficient detection of all pointer and array SIGSOFT international symposium on Software testing access errors. In PLDI, pages 290–301. ACM, 1994. and analysis, ISSTA ’94, pages 109–124. ACM, 1994. [13] A. Avizienis. The N-version approach to fault- [27] G.V. Bochmann. Finite state description of commu- 25

nication protocols. Computer Networks, 2(4):361–372, the class and cluster levels. ACM Trans. Softw. Eng. 1978. Methodol., 10(1):56–109, January 2001. [28]E.B orger,¨ A. Cavarra, and E. Riccobene. Modeling [42] T. Y. Chen, F.-C. Kuo, T. H. Tse, and Zhi Quan Zhou. the dynamics of UML state machines. In Abstract Metamorphic testing and beyond. In Proceedings State Machines-Theory and Applications, pages 167– of the International Workshop on Software Technology 186. Springer, 2000. and Engineering Practice (STEP 2003), pages 94–100, [29] Egon Borger.¨ High level system design and analysis September 2004. using abstract state machines. In Proceedings of the [43] Tsong Chen, Dehao Huang, Haito Huang, Tsun- International Workshop on Current Trends in Applied Him Tse, Zong Yang, and Zhi Zhou. Metamorphic Formal Method: Applied , FM-Trends testing and its applications. In Proceedings of the 8th 98, pages 1–43, London, UK, UK, 1999. Springer- International Symposium on Future Software Technology, Verlag. ISFST 2004, pages 310–319, 2004. [30] Kathrin Bottger,¨ Rolf Schwitter, Diego Molla,´ and [44] Yoonsik Cheon. Abstraction in assertion-based test Debbie Richards. Towards reconciling use cases via oracles. In Proceedings of the Seventh International controlled language and graphical models. In INAP, Conference on Quality Software, pages 410–414, Wash- pages 115–128, Berlin, Heidelberg, 2003. Springer- ington, DC, USA, 2007. IEEE Computer Society. Verlag. [45] Yoonsik Cheon and Gary T. Leavens. A simple and [31] F. Bouquet, C. Grandpierre, B. Legeard, F. Peureux, practical approach to unit testing: The JML and JUnit N. Vacelet, and M. Utting. A subset of precise way. In Proceedings of the 16th European Conference on UML for model-based testing. In Proceedings of the Object-Oriented Programming, ECOOP ’02, pages 231– 3rd International Workshop on Advances in Model-Based 255, London, UK, 2002. Springer-Verlag. Testing, A-MOST ’07, pages 95–104, New York, NY, [46] L.A. Clarke. A system to generate test data and USA, 2007. ACM. symbolically execute programs. Software Engineering, [32] Mustafa Bozkurt and Mark Harman. Automatically IEEE Transactions on, SE-2(3):215 – 222, sept. 1976. generating realistic test input from web services. In [47] Paul C. Clements. Managing variability for software IEEE 6th International Symposium on Service Oriented product lines: Working with variability mechanisms. System Engineering (SOSE), pages 13–24, 2011. In 10th International Conference on Software Product [33] L. C. Briand, Y. Labiche, and H. Sun. Investigating Lines (SPLC 2006), pages 207–208, Baltimore, Mary- the use of analysis contracts to improve the testa- land, USA, 2006. IEEE Computer Society. bility of object-oriented code. Softw. Pract. Exper., [48] Markus Clermont and David Parnas. Using informa- 33(7):637–672, June 2003. tion about functions in selecting test cases. In Pro- [34] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski, ceedings of the 1st international workshop on Advances David L. Dill, and Dawson R. Engler. EXE: Auto- in model-based testing, A-MOST ’05, pages 1–7, New matically generating inputs of death. ACM Trans. York, NY, USA, 2005. ACM. Inf. Syst. Secur., 12(2):10:1–10:38, December 2008. [49] David Coppit and Jennifer M. Haddox-Schatz. On [35] J. Callahan, F. Schneider, S. Easterbrook, et al. Au- the use of specification-based assertions as test ora- tomated software testing using model-checking. In cles. In Proceedings of the 29th Annual IEEE/NASA Proceedings 1996 SPIN workshop, volume 353. Cite- on Software Engineering Workshop, pages 305–314, seer, 1996. Washington, DC, USA, 2005. IEEE Computer Society. [36] F. T. Chan, T. Y. Chen, S. C. Cheung, M. F. Lau, [50] M. Davies and E. Weyuker. Pseudo-oracles for non- and S. M. Yiu. Application of metamorphic testing testable programs. In Proceedings of the ACM ’81 in numerical analysis. In Proceedings of the IASTED Conference, pages 254–257, 1981. International Conference on Software Engineering, pages [51] Laura K. Dillon. Automated support for testing and 191–197, 1998. debugging of real-time programs using oracles. SIG- [37] W.K. Chan, S.C. Cheung, and Karl R.P.H. Leung. SOFT Softw. Eng. Notes, 25(1):45–46, January 2000. A metamorphic testing approach for online testing of [52] Roong-Ko Doong and Phyllis G. Frankl. The AS- service-oriented software applications, chapter 7, pages TOOT approach to testing object-oriented programs. 2894–2914. IGI Global, 2009. ACM Trans. Softw. Eng. Methodol., 3:101–130, April [38] W.K. Chan, S.C. Cheung, and K.R.P.H. Leung. 1994. Towards a metamorphic testing methodology for [53] Edith Elkind, Blaise Genest, Doron Peled, and service-oriented software applications. In QSIC, Hongyang Qu. Grey-box checking. In FORTE, pages pages 470–476, September 2005. 420–435, 2006. [39] F. Chen, N. Tillmann, and W. Schulte. Discovering [54] J. Ellsberger, D. Hogrefe, and A. Sarma. SDL: for- specifications. Technical Report MSR-TR-2005-146, mal object-oriented language for communicating systems. Microsoft Research, October 2005. Prentice Hall, 1997. [40] Huo Yan Chen, T. H. Tse, F. T. Chan, and T. Y. [55] M.D. Ernst, J.H. Perkins, P.J. Guo, S. McCamant, Chen. In black and white: an integrated approach to C. Pacheco, M.S. Tschantz, and C. Xiao. The Daikon class-level testing of object-oriented programs. ACM system for dynamic detection of likely invariants. Trans. Softw. Eng. Methodol., 7:250–295, July 1998. Science of Computer Programming, 69(1):35–45, 2007. [41] Huo Yan Chen, T. H. Tse, and T. Y. Chen. TACCLE: a [56] Michael D. Ernst, Jake Cockrell, William G. Gris- methodology for object-oriented software testing at wold, and David Notkin. Dynamically discovering 26

likely program invariants to support program evo- Programming, Concurrency, Simulation and Automated lution. IEEE Trans. Software Eng., 27(2):99–123, 2001. Reasoning, pages 329–348, 1993. [57] R.A. Eyre-Todd. The detection of dangling references [72] Marie-Claude Gaudel. Testing from formal specifi- in C++ programs. ACM Letters on Programming Lan- cations, a generic approach. In Proceedings of the 6th guages and Systems (LOPLAS), 2(1-4):127–134, 1993. Ade-Europe International Conference Leuven on Reliable [58] R. Feldt. Generating diverse software versions with Software Technologies, Ada Europe ’01, pages 35–48, genetic programming: an experimental study. Soft- London, UK, 2001. Springer-Verlag. ware, IEE Proceedings, 145, December 1998. [73] Marie-Claude Gaudel and Perry R. James. Testing [59] Xin Feng, David Lorge Parnas, T. H. Tse, and Tony algebraic data types and processes: A unifying the- O’Callaghan. A comparison of tabular expression- ory. Formal Asp. Comput., 10(5-6):436–451, 1998. based testing strategies. IEEE Trans. Softw. Eng., [74] Gregory Gay, Sanjai Rayadurgam, and Mats Heim- 37(5):616–634, September 2011. dah. Improving the accuracy of oracle verdicts [60] Xin Feng, David Lorge Parnas, and T.H. Tse. Tabular through automated model steering. In Automated expression-based testing strategies: A comparison. Software Engineering (ASE 2014). ACM Press, 2014. Testing: Academic and Industrial Conference Practice and [75] Patrice Godefroid, Nils Klarlund, and Koushik Sen. Research Techniques - MUTATION, 0:134, 2007. DART: directed automated random testing. In PLDI, [61] J. Ferrer, F. Chicano, and E. Alba. Evolutionary pages 213–223. ACM, 2005. algorithms for the multi-objective test data gen- [76] Alex Groce, Mohammad Amin Alipour, Chaoqiang eration problem. Software: Practice and Experience, Zhang, Yang Chen, and John Regehr. Cause reduc- 42(11):1331–1362, 2011. tion for quick testing. In ICST, pages 243–252, 2014. [62] John S. Fitzgerald and Peter Gorm Larsen. Modelling [77] Alex Groce, Doron Peled, and Mihalis Yannakakis. Systems - Practical Tools and Techniques in Software Amc: An adaptive model checker. In CAV, pages Development (2. ed.). Cambridge University Press, 521–525, 2002. 2009. [78] Roland Groz, Keqin Li, Alexandre Petrenko, and [63] C. Flanagan and K. R. M. Leino. Houdini, an Muzammil Shahbaz. Modular system verification annotation assistant for ESC/Java. Lecture Notes in by inference, testing and reachability analysis. In Computer Science, 2021:500–517, 2001. TestCom/FATES, pages 216–233, 2008. [64] Robert W. Floyd. Assigning meanings to programs. [79] Zhongxian Gu, Earl T. Barr, David J. Hamilton, and In J. T. Schwartz, editor, Mathematical Aspects of Zhendong Su. Has the bug really been fixed? In Pro- Computer Science, volume 19 of Symposia in Applied ceedings of the 2010 International Conference on Software Mathematics, pages 19–32. American Mathematical Engineering (ICSE’10). IEEE Computer Society, 2010. Society, Providence, RI, 1967. [80] R. Guderlei and J. Mayer. Statistical metamorphic [65] Gordon Fraser and Andrea Arcuri. Whole test suite testing testing programs with random output by generation. IEEE Transactions on Software Engineering, means of statistical hypothesis tests and metamor- 39(2):276–291, 2013. phic testing. In QSIC, pages 404–409, October 2007. [66] Gordon Fraser and Andreas Zeller. Exploiting com- [81] D. Harel. Statecharts: A visual formalism for mon object usage in test case generation. In Proceed- complex systems. Science of computer programming, ings of the 2011 Fourth IEEE International Conference 8(3):231–274, 1987. on Software Testing, Verification and Validation, ICST [82] Mark Harman, Sung Gon Kim, Kiran Lakhotia, Phil ’11, pages 80–89. IEEE Computer Society, 2011. McMinn, and Shin Yoo. Optimizing for the num- [67] Kambiz Frounchi, Lionel C. Briand, Leo Grady, Yvan ber of tests generated in search based test data Labiche, and Rajesh Subramanyan. Automating generation with an application to the oracle cost image segmentation verification and validation by problem. In International Workshop on Search-Based learning test oracles. Information and Software Tech- Software Testing (SBST 2010), pages 182–191. IEEE, 6 nology, 53(12):1337–1348, 2011. April 2010. [68] Pascale Gall and Agns Arnould. Formal specifica- [83] Mark Harman, Afshin Mansouri, and Yuanyuan tions and test: Correctness and oracle. In Magne Zhang. Search based software engineering: A com- Haveraaen, Olaf Owe, and Ole-Johan Dahl, editors, prehensive analysis and review of trends techniques Recent Trends in Data Type Specification, volume 1130 and applications. Technical Report TR-09-03, Depart- of Lecture Notes in Computer Science, pages 342–358. ment of Computer Science, King’s College London, Springer Berlin Heidelberg, 1996. April 2009. [69] J. Gannon, P. McMullin, and R. Hamlet. Data ab- [84] Mark Harman, Phil McMinn, Jerffeson Teixeira de straction, implementation, specification, and testing. Souza, and Shin Yoo. Search based software engi- ACM Transactions on Programming Languages and Sys- neering: Techniques, taxonomy, tutorial. In Bertrand tems (TOPLAS), 3(3):211–223, 1981. Meyer and Martin Nordio, editors, Empirical software [70] Angelo Gargantini and Elvinia Riccobene. ASM- engineering and verification: LASER 2009-2010, pages based testing: Coverage criteria and automatic test 1–59. Springer, 2012. LNCS 7007. sequence. Journal of Universal Computer Science, [85] M. Jean Harrold, Rajiv Gupta, and Mary Lou Soffa. 7(11):1050–1067, nov 2001. A methodology for controlling the size of a test suite. [71] Stephen J. Garland, John V. Guttag, and James J. ACM Trans. Softw. Eng. Methodol., 2(3):270–285, July Horning. An overview of Larch. In Functional 1993. 27

[86] Mary Jean Harrold, Gregg Rothermel, Kent Sayre, cations Conference, 2006. COMPSAC ’06. 30th Annual Rui Wu, and Liu Yi. An empirical investigation International, volume 1, pages 411–420, 2006. of the relationship between spectra differences and [105] Ying Jin and David Lorge Parnas. Defining the regression faults. Software Testing, Verification, and meaning of tabular mathematical expressions. Sci. Reliability, 10(3):171–194, 2000. Comput. Program., 75(11):980–1000, November 2010. [87] David L. Heine and Monica S. Lam. A practical flow- [106] M.J. Kearns and U.V. Vazirani. An Introduction to sensitive and context-sensitive C and C++ memory Computational Learning Theory. MIT Press, 1994. leak detector. In PLDI, pages 168–181. ACM, 2003. [107] R.M. Keller. Formal verification of parallel programs. [88] J. Henkel and A. Diwan. Discovering algebraic Communications of the ACM, 19(7):371–384, 1976. specifications from Java classes. Lecture Notes in [108] James C. King. Symbolic execution and program Computer Science, 2743:431–456, 2003. testing. Communications of the ACM, 19(7):385–394, [89] F.C. Hennie. Finite-state models for logical machines. July 1976. Wiley, 1968. [109] Kiran Lakhotia, Phil McMinn, and Mark Harman. [90] MarijnJ.H. Heule and Sicco Verwer. Software model An empirical investigation into branch coverage for synthesis using satisfiability solvers. Empirical Soft- C programs using CUTE and AUSTIN. Journal of ware Engineering, pages 1–32, 2012. Systems and Software, 83(12):2379–2391, 2010. [91] R.M. Hierons. Oracles for distributed testing. Soft- [110] Axel van Lamsweerde. Formal specification: a ware Engineering, IEEE Transactions on, 38(3):629–641, roadmap. In Proceedings of the Conference on The 2012. Future of Software Engineering, ICSE ’00, pages 147– [92] Robert M. Hierons. Verdict functions in testing with 159, New York, NY, USA, 2000. ACM. a fault domain or test hypotheses. ACM Trans. Softw. [111] K. Lano and H. Haughton. Specification in B: An Eng. Methodol., 18(4):14:1–14:19, July 2009. Introduction Using the B Toolkit. Imperial College [93] Robert M. Hierons, Kirill Bogdanov, Jonathan P. Press, 1996. Bowen, Rance Cleaveland, John Derrick, Jeremy [112] D. Lee and M. Yannakakis. Principles and methods Dick, Marian Gheorghe, Mark Harman, Kalpesh of testing finite state machines—a survey. Proceedings Kapoor, Paul Krause, Gerald Luttgen,¨ Anthony J. H. of the IEEE, 84(8):1090–1123, aug 1996. Simons, , Martin R. Woodward, and [113] A. Leitner, M. Oriol, A Zeller, I. Ciupa, and B. Meyer. Hussein Zedan. Using formal specifications to Efficient unit test case minimization. In Automated support testing. ACM Comput. Surv., 41:9:1–9:76, Software Engineering (ASE 2007), pages 417–420, At- February 2009. lanta, Georgia, USA, 2007. ACM Press. [94] Charles Anthony Richard Hoare. An Axiomatic [114] Keqin Li, Roland Groz, and Muzammil Shahbaz. Basis of Computer Programming. Communications Integration testing of components guided by incre- of the ACM, 12:576–580, 1969. mental state machine learning. In TAIC PART, pages [95] M. Holcombe. X-machines as a basis for dynamic 59–70, 2006. system specification. Software Engineering Journal, [115] D. Lorenzoli, L. Mariani, and M. Pezze.` Automatic 3(2):69–76, 1988. generation of software behavioral models. In pro- [96] Mike Holcombe and Florentin Ipate. Correct sys- ceedings of the 30th International Conference on Software tems: building a business process solution. Software Engineering (ICSE), 2008. Testing Verification and Reliability, 9(1):76–77, 1999. [116] Pablo Loyola, Matthew Staats, In-Young Ko, and [97] Gerard J. Holzmann. The model checker SPIN. IEEE Gregg Rothermel. Dodona: Automated oracle data Trans. Softw. Eng., 23(5):279–295, May 1997. set selection. In International Symposium on Software [98] W E Howden. A functional approach to pro- Testing and Analysis 2014, ISSTA’04, 2014. gram testing and analysis. IEEE Trans. Softw. Eng., [117] D. Luckham and F.W. Henke. An overview of 12(10):997–1005, October 1986. ANNA-a specification language for ADA. Technical [99] W.E. Howden. Theoretical and empirical studies report, Stanford University, 1984. of program testing. IEEE Transactions on Software [118] Nancy A. Lynch and Mark R. Tuttle. An introduction Engineering, 4(4):293–298, July 1978. to input/output automata. CWI Quarterly, 2:219–246, [100] Merlin Hughes and David Stotts. Daistish: sys- 1989. tematic algebraic testing for OO programs in the [119] Patr´ıcia D. L. Machado. On oracles for interpret- presence of side-effects. SIGSOFT Softw. Eng. Notes, ing test results against algebraic specifications. In 21(3):53–61, May 1996. AMAST, pages 502–518. Springer-Verlag, 1999. [101] Daniel Jackson. Software Abstractions: Logic, Language, [120] Phil Maker. GNU Nana: improved support for and Analysis. The MIT Press, 2006. assertion checking and logging in GNU C/C++, [102] Daniel Jackson. Software Abstractions: Logic, Language, 1998. http://gnu.cs.pu.edu.tw/software/nana/. and Analysis. The MIT Press, 2006. [121] Haroon Malik, Hadi Hemmati, and Ahmed E. Has- [103] Claude Jard and Gregor v. Bochmann. An approach san. Automatic detection of performance deviations to testing specifications. Journal of Systems and Soft- in the load testing of large scale systems. In Proceed- ware, 3(4):315 – 323, 1983. ings of International Conference on Software Engineering [104] D. Jeffrey and R. Gupta. Test case prioritization - Software Engineering in Practice Track, ICSE 2013, using relevant slices. In Computer Software and Appli- page to appear, 2013. 28

[122] L. Mariani, F. Pastore, and M. Pezze.` Dynamic inference of models for black box systems based analysis for diagnosing integration faults. IEEE on interface descriptions. In Proceedings of the 5th Transactions on Software Engineering, 37(4):486–508, international conference on Leveraging Applications of 2011. Formal Methods, Verification and Validation: technologies [123] Bruno Marre. Loft: A tool for assisting selection for mastering change - Volume Part I, ISoLA’12, pages of test data sets from algebraic specifications. In 79–96. Springer-Verlag, 2012. TAPSOFT, pages 799–800. Springer-Verlag, 1995. [136] Bertrand Meyer. Eiffel: A language and environment [124] A.P. Mathur. Performance, effectiveness, and relia- for software engineering. The Journal of Systems and bility issues in software testing. In COMPSAC, pages Software, 8(3):199–246, June 1988. 604–605. IEEE, 1991. [137] B.P. Miller, L. Fredriksen, and B. So. An empirical [125] Johannes Mayer, Ralph Guderlei, Abteilung Angew, study of the reliability of UNIX utilities. Communi- Te Informationsverarbeitung, Abteilung Stochastik, cations of the ACM, 33(12):32–44, 1990. and Universitt Ulm. Test oracles using statistical [138] S. Mouchawrab, L.C. Briand, Y. Labiche, and methods. In In: Proceedings of the First International M. Di Penta. Assessing, comparing, and combining Workshop on Software Quality, Lecture Notes in Infor- state machine-based testing and structural testing: matics P-58, Kllen Druck+Verlag GmbH, pages 179– A series of experiments. Software Engineering, IEEE 189. Springer, 2004. Transactions on, 37(2):161–187, 2011. [126] P. McMinn, M. Stevenson, and M. Harman. Reduc- [139] Christian Murphy, Kuang Shen, and Gail Kaiser. ing qualitative human oracle costs associated with Automatic system testing of programs without test automatically generated test data. In 1st International oracles. In ISSTA, pages 189–200. ACM Press, 2009. Workshop on Software Test Output Validation (STOV [140] Christian Murphy, Kuang Shen, and Gail Kaiser. 2010), Trento, Italy, 13th July 2010, pages 1–4, 2010. Using JML runtime assertion checking to automate [127] Phil McMinn. Search-based software test data gen- metamorphic testing in applications without test or- eration: a survey. Softw. Test. Verif. Reliab., 14(2):105– acles. 2009 International Conference on Software Testing 156, June 2004. Verification and Validation, pages 436–445, 2009. [128] Phil McMinn. Search-based failure discovery us- [141] A. J. Offutt, J. Pan, and J. M. Voas. Procedures for ing testability transformations to generate pseudo- reducing the size of coverage-based test sets. In oracles. In Genetic and Evolutionary Computation International Conference on Testing Computer Software, Conference (GECCO 2009), pages 1689–1696. ACM pages 111–123, 1995. Press, 8-12 July 2009. [142] C. Pacheco and M. Ernst. Eclat: Automatic genera- [129] Phil McMinn. Search-based software testing: Past, tion and classification of test inputs. ECOOP 2005- International Workshop on present and future. In Object-Oriented Programming, pages 734–734, 2005. Search-Based Software Testing (SBST 2011), pages 153– [143] D L Parnas, J Madey, and M Iglewski. Precise 163. IEEE, 21 March 2011. documentation of well-structured programs. IEEE [130] Phil McMinn, Muzammil Shahbaz, and Mark Transactions on Software Engineering, 20(12):948–976, Stevenson. Search-based test input generation for 1994. string data types using the results of web queries. In ICST, pages 141–150, 2012. [144] David Lorge Parnas. Document based rational software development. Journal of Knowledge Based [131] Phil McMinn, Mark Stevenson, and Mark Harman. Systems, 22:132–141, April 2009. Reducing qualitative human oracle costs associated with automatically generated test data. In Inter- [145] David Lorge Parnas. Precise documentation: The national Workshop on Software Test Output Validation key to better software. In Sebastian Nanz, editor, (STOV 2010), pages 1–4. ACM, 13 July 2010. The Future of Software Engineering, pages 125–148. [132] Atif M. Memon. Automatically repairing event Springer Berlin Heidelberg, 2011. sequence-based GUI test suites for regression test- [146] David Lorge Parnas and Jan Madey. Functional doc- ing. ACM Transactions on Software Engineering uments for computer systems. Sci. Comput. Program., Methodology, 18(2):1–36, 2008. 25(1):41–61, October 1995. [133] Atif M. Memon, Martha E. Pollack, and Mary Lou [147] F. Pastore, L. Mariani, and G. Fraser. Crowdoracles: Soffa. Automated test oracles for GUIs. In SIGSOFT Can the crowd solve the oracle problem? In Proceed- ’00/FSE-8: Proceedings of the 8th ACM SIGSOFT inter- ings of the International Conference on Software Testing, national symposium on Foundations of software engineer- Verification and Validation (ICST), 2013. ing, pages 30–39, New York, NY, USA, 2000. ACM [148] Doron Peled, Moshe Y. Vardi, and Mihalis Yan- Press. nakakis. Black box checking. Journal of Automata, [134] Atif M. Memon and Qing Xie. Using tran- Languages and Combinatorics, 7(2), 2002. sient/persistent errors to develop automated test [149] D K Peters and D L Parnas. Using test oracles gen- oracles for event-driven software. In ASE ’04: Pro- erated from program documentation. IEEE Transac- ceedings of the 19th IEEE international conference on tions on Software Engineering, 24(3):161–173, 1998. Automated software engineering, pages 186–195, Wash- [150] Dennis K. Peters and David Lorge Parnas. ington, DC, USA, 2004. IEEE Computer Society. Requirements-based monitors for real-time systems. [135] Maik Merten, Falk Howar, Bernhard Steffen, Pa- IEEE Trans. Softw. Eng., 28(2):146–158, February trizio Pellicione, and Massimo Tivoli. Automated 2002. 29

[151] Mauro Pezze` and Cheng Zhang. Automated test on automated software test oracle methods. In oracles: A survey. In Ali Hurson and Atif Memon, ICSEA, pages 140–145, 2009. editors, Advances in Computers, volume 95, pages 1– [166] M. Shahbaz. Reverse Engineering and Testing of Black- 48. Elsevier Ltd., 2014. Box Software Components. LAP Lambert Academic [152] Nadia Polikarpova, Ilinca Ciupa, and Bertrand Publishing, 2012. Meyer. A comparative study of programmer-written [167] K. Shrestha and M.J. Rutherford. An empirical and automatically inferred contracts. In ISSTA, evaluation of assertions as oracles. In ICST, pages pages 93–104. ACM, 2009. 110–119. IEEE, 2011. [153] S.J. Prowell and J.H. Poore. Foundations of [168] Guoqiang Shu, Yating Hsu, and David Lee. De- sequence-based software specification. Software En- tecting communication protocol security flaws by gineering, IEEE Transactions on, 29(5):417–429, 2003. formal fuzz testing and machine learning. In FORTE, [154] Sam Ratcliff, David R. White, and John A. Clark. pages 299–304, 2008. Searching for invariants using genetic programming [169] J. Sifakis. Deadlocks and livelocks in transition sys- and mutation testing. In Proceedings of the 13th annual tems. Mathematical Foundations of Computer Science conference on Genetic and evolutionary computation, 1980, pages 587–600, 1980. GECCO ’11, pages 1907–1914, New York, NY, USA, [170] A. J. H. Simons. JWalk: a tool for lazy systematic 2011. ACM. testing of java classes by design introspection and [155] F. Ricca and P. Tonella. Detecting anomaly and fail- user interaction. Automated Software Engineering, ure in web applications. MultiMedia, IEEE, 13(2):44 14(4):369–418, December 2007. – 51, april-june 2006. [171] Rishabh Singh, Dimitra Giannakopoulou, and Co- [156] D.S. Rosenblum. A practical approach to program- rina S. Pasareanu. Learning component interfaces ming with assertions. Software Engineering, IEEE with may and must abstractions. In Tayssir Touili, Transactions on, 21(1):19–31, 1995. Byron Cook, and Paul Jackson, editors, Computer [157] G. Rothermel, M. J. Harrold, J. von Ronne, and Aided Verification, 22nd International Conference, CAV C. Hong. Empirical studies of test-suite reduction. 2010, Edinburgh, UK, July 15-19, 2010. Proceedings, Software Testing, Verification and Reliability, 12:219— volume 6174 of Lecture Notes in Computer Science, 249, 2002. pages 527–542. Springer, 2010. [158] N. Walkinshaw S. Ali, K. Bogdanov. A comparative [172] J. Michael Spivey. Z Notation - a reference manual (2. study of methods for dynamic reverse-engineering ed.). Prentice Hall International Series in Computer of state models. Technical Report CS-07-16, The Science. Prentice Hall, 1992. University of Sheffield, Department of Computer [173] M. Staats, G. Gay, and M. P. E. Heimdahl. Auto- Science, October 2007. http://www.dcs.shef.ac.uk/ mated oracle creation support, or: How I learned to intranet/research/resmes/CS0716.pdf. stop worrying about fault propagation and love mu- [159] David Schuler and Andreas Zeller. Assessing oracle tation testing. In Proceedings of the 34th International quality with checked coverage. In Proceedings of the Conference on Software Engineering, ICSE 2012, pages 2011 Fourth IEEE International Conference on Software 870–880, 2012. Testing, Verification and Validation, ICST ’11, pages [174] M. Staats, M.W. Whalen, and M.P.E. Heimdahl. Pro- 90–99, Washington, DC, USA, 2011. IEEE Computer grams, tests, and oracles: the foundations of testing Society. revisited. In ICSE, pages 391–400. IEEE, 2011. [160] R. Schwitter. English as a formal specification [175] Matt Staats, Shin Hong, Moonzoo Kim, and Gregg language. In Database and Expert Systems Applica- Rothermel. Understanding user understanding: de- tions, 2002. Proceedings. 13th International Workshop termining correctness of generated program invari- on, pages 228–232, sept. 2002. ants. In ISSTA, pages 188–198. ACM, 2012. [161] Sergio Segura, Robert M. Hierons, David Benavides, [176] Matt Staats, Pablo Loyola, and Gregg Rothermel. and Antonio Ruiz-Cortes.´ Automated metamorphic Oracle-centric test case prioritization. In Proceedings testing on the analyses of feature models. Information of the 23rd International Symposium on Software Relia- and Software Technology, 53(3):245 – 258, 2011. bility Engineering, ISSRE 2012, pages 311–320, 2012. [162] Koushik Sen, Darko Marinov, and Gul Agha. CUTE: [177] Ari Takanen, Jared DeMott, and Charlie Miller. a concolic unit testing engine for C. In ESEC/FSE, Fuzzing for Software Security Testing and Quality As- pages 263–272. ACM, 2005. surance. Artech House, Inc., Norwood, MA, USA, 1 [163] Seyed Shahamiri, Wan Wan-Kadir, Suhaimi Ibrahim, edition, 2008. and Siti Hashim. Artificial neural networks as multi- [178] Ramsay Taylor, Mathew Hall, Kirill Bogdanov, and networks automated test oracle. Automated Software John Derrick. Using behaviour inference to optimise Engineering, 19(3):303–334, 2012. regression test sets. In Brian Nielsen and Carsten [164] Seyed Reza Shahamiri, Wan Mohd Nasir Wan Kadir, Weise, editors, Testing Software and Systems - 24th IFIP Suhaimi Ibrahim, and Siti Zaiton Mohd Hashim. WG 6.1 International Conference, ICTSS 2012, Aalborg, An automated framework for software test oracle. Denmark, November 19-21, 2012. Proceedings, volume Information and Software Technology, 53(7):774 – 788, 7641 of Lecture Notes in Computer Science, pages 184– 2011. 199. Springer, 2012. [165] Seyed Reza Shahamiri, Wan Mohd Nasir Wan-Kadir, [179] A. Tiwari. Formal semantics and analysis methods and Siti Zaiton Mohd Hashim. A comparative study for Simulink Stateflow models. Technical report, SRI 30

International, 2002. http://www.csl.sri.com/users/ [194] E.J. Weyuker. Assessing test data adequacy through tiwari/html/stateflow.html. program inference. ACM Transactions on Program- [180] Jan Tretmans. Test generation with inputs, outputs ming Languages and Systems (TOPLAS), 5(4):641–655, and repetitive quiescence. Software - Concepts and 1983. Tools, 17(3):103–120, 1996. [195] E.J. Weyuker and F.I. Vokolos. Experience with [181] Alan M. Turing. Checking a large routine. In Report performance testing of software systems: issues, an of a Conference on High Speed Automatic Calculating approach, and case study. Software Engineering, IEEE Machines, pages 67–69, Cambridge, England, June Transactions on, 26(12):1147–1156, 2000. 1949. University Mathematical Laboratory. [196] Elaine J. Weyuker. On testing non-testable programs. [182] Mark Utting and Bruno Legeard. Practical Model- The Computer Journal, 25(4):465–470, November 1982. Based Testing: A Tools Approach. Morgan Kaufmann [197] Jeannette M. Wing. A specifier’s introduction to Publishers Inc., San Francisco, CA, USA, 2007. formal methods. IEEE Computer, 23(9):8–24, 1990. [183] Mark Utting, Alexander Pretschner, and Bruno Leg- [198] Qing Xie and Atif M. Memon. Designing and eard. A taxonomy of model-based testing ap- comparing automated test oracles for GUI-based proaches. Softw. Test. Verif. Reliab., 22(5):297–312, software applications. ACM Transactions on Software August 2012. Engineering and Methodology, 16(1):4, 2007. [184] G. v. Bochmann, C. He, and D. Ouimet. Protocol [199] Tao Xie. Augmenting automatically generated unit- testing using automatic trace analysis. In Proceed- test suites with regression oracle checking. In Proc. ings of Canadian Conference on Electrical and Computer 20th European Conference on Object-Oriented Program- Engineering, pages 814–820, 1989. ming (ECOOP 2006), pages 380–403, July 2006. [185] A. van Lamsweerde and M. Sintzoff. Formal deriva- [200] Tao Xie and David Notkin. Checking inside the black tion of strongly correct concurrent programs. Acta box: Regression testing by comparing value spectra. Informatica, 12(1):1–31, 1979. IEEE Transactions on Software Engineering, 31(10):869– [186] J.M. Voas. PIE: a dynamic failure-based technique. 883, October 2005. Software Engineering, IEEE Transactions on, 18(8):717– [201] Yichen Xie and Alex Aiken. Context-and path- 727, 1992. sensitive memory leak detection. ACM SIGSOFT [187] N. Walkinshaw, K. Bogdanov, J. Derrick, and J. Paris. Software Engineering Notes, 30(5):115–125, 2005. Increasing functional coverage by inductive testing: [202] Zhihong Xu, Yunho Kim, Moonzoo Kim, Gregg A case study. In Alexandre Petrenko, Adenilso Rothermel, and Myra B. Cohen. Directed test suite da Silva Simao,˜ and Jose´ Carlos Maldonado, edi- augmentation: techniques and tradeoffs. In Pro- tors, Testing Software and Systems - 22nd IFIP WG ceedings of the eighteenth ACM SIGSOFT international 6.1 International Conference, ICTSS 2010, Natal, Brazil, symposium on Foundations of software engineering, FSE November 8-10, 2010. Proceedings, volume 6435 of ’10, pages 257–266, New York, NY, USA, 2010. ACM. Lecture Notes in Computer Science, pages 126–141. Springer, 2010. [203] Shin Yoo. Metamorphic testing of stochastic optimi- [188] Neil Walkinshaw and Kirill Bogdanov. Inferring sation. In Proceedings of the 2010 Third International finite-state models with temporal constraints. In Conference on Software Testing, Verification, and Vali- ASE, pages 248–257, 2008. dation Workshops, ICSTW ’10, pages 192–201. IEEE Computer Society, 2010. [189] Neil Walkinshaw, John Derrick, and Qiang Guo. Iterative refinement of reverse-engineered models by [204] Shin Yoo and Mark Harman. Regression testing model-based testing. In FM, pages 305–320, 2009. minimisation, selection and prioritisation: A survey. [190] Yabo Wang and DavidLorge Parnas. Trace rewriting Software Testing, Verification, and Reliability, 22(2):67– systems. In Michael¨ Rusinowitch and Jean-Luc 120, March 2012. Remy,´ editors, Conditional Term Rewriting Systems, [205] Bo Yu, Liang Kong, Yufeng Zhang, and Hong Zhu. volume 656 of Lecture Notes in Computer Science, Testing Java components based on algebraic specifi- pages 343–356. Springer Berlin Heidelberg, 1993. cations. Software Testing, Verification, and Validation, [191] Yabo Wang and D.L. Parnas. Simulating the be- 2008 International Conference on, 0:190–199, 2008. haviour of software modules by trace rewriting. In [206] A. Zeller and R Hildebrandt. Simplifying and iso- Software Engineering, 1993. Proceedings., 15th Interna- lating failure-inducing input. IEEE Transactions on tional Conference on, pages 14–23, 1993. Software Engineering, 28(2):183–200, 2002. [192] Yi Wei, Carlo A. Furia, Nikolay Kazmin, and [207] S. Zhang, D. Saff, Y. Bu, and M.D. Ernst. Combined Bertrand Meyer. Inferring better contracts. In Pro- static and dynamic automated test generation. In ceedings of the 33rd International Conference on Software ISSTA, volume 11, pages 353–363, 2011. Engineering, ICSE ’11, pages 191–200, New York, NY, [208] Wujie Zheng, Hao Ma, Michael R. Lyu, Tao Xie, USA, 2011. ACM. and Irwin King. Mining test oracles of web search [193] Yi Wei, H. Roth, C.A. Furia, Yu Pei, A. Horton, engines. In Proc. 26th IEEE/ACM International Con- M. Steindorfer, M. Nordio, and B. Meyer. Stateful ference on Automated Software Engineering (ASE 2011), testing: Finding more errors in code and contracts. Short Paper, ASE 2011, pages 408–411, 2011. In Automated Software Engineering (ASE), 2011 26th [209] Zhi Quan Zhou, ShuJia Zhang, Markus Hagenbuch- IEEE/ACM International Conference on, pages 440–443, ner, T. H. Tse, Fei-Ching Kuo, and T. Y. Chen. Au- 2011. tomated functional testing of online search services. 31

Software Testing, Verification and Reliability, 22(4):221– 243, 2012. [210] Hong Zhu. A note on test oracles and semantics of algebraic specifications. In Proceedings of the 3rd International Conference on Quality Software, QSIC 2003, pages 91–98, 2003. [211] B. Zorn and P. Hilfinger. A memory allocation profiler for C and Lisp programs. In Proceedings of the Summer USENIX Conference, pages 223–237, 1988.