Data Flow Analysis In Software Reliability*

LLOYD D. FOSDICK and LEON J. OSTERWEIL Department of Computer ~cience, University of Colorado, Boulder, Colorado 80809

The ways that the methods of data flow analysis can be applied to improve software reliability are described. There is also a review of the basic terminology from graph theory and from data flow analysis in global program optimization. The notation of regular expressions is used to describe actions on data for sets of paths. These expressions provide the basis of a classification scheme for data flow which represents patterns of data flow along paths within subprograms and along paths which cross subprogram boundaries. Fast algorithms, originally introduced for global optimization, are described and it is shown how they can be used to implement the classification scheme. It is then shown how these same algorithms can also be used to detect the presence of data flow anomalies which are symptomatic of programming errors. Finally, some characteristics of and experience with DAVE, a data flow analysis system embodying some of these ideas, are described. Keywords and Phrases: automatic documentation, automatic error detection, data flow analysis, software reliability CR Categories: 4.40, 5.24

INTRODUCTION over, during its construction advances were made in global optimization algorithms For some time we have believed that a that are useful to us, which for the same careful analysis of the use of data in a reasons could not be incorporated in the program, such as that done in global opti- system. Our purpose in writing this paper mization, could be a powerful means for is to draw these various ideas together and detecting errors in software and otherwise present them for the instruction and stimu- improving its quality. Our recent experience lation of others who are interested in the [27, 28] with a system constructed for this problem of software reliability. purpose confirms this belief. As so often The phrase "data flow analysis" became happens on such projects, our knowledge firmly established in the literature of global and understanding of this approach were program optimization several years ago deepened considerably by the experience through the work of Cocke and Allen [2, 3, gained in constructing this system, although the pressures of meeting various deadlines 4, 5, 6]. Considerable attention has also made it impossible to incorporate all of our been given to data flow by Dennis and his developing ideas into the system. More- co-workers [9, 29] in a different context, * This work supported by NSF Grant DCR advanced computer architecture. Our own 754)9972. interpretation of data flow analysis is simi-

Copyright © 1976, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted provided that ACM's copyright notice is given and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery.

Computing Surveys, Vol. 8, No. 3, September 1976 306 • Data Flow Analysis In Software Reliability

CONTENTS after some designated computation step. If it is not to be used, space for that vari- able may be reallocated or an unnecessary assignment of a value can be deleted. To make this determination it is necessary to look in effect at all possible execution se- quences starting at the designated execution INTRODUCTION step to see if the variable under considera- BASIC DEFINITIONS--GRAPHS tion is ever used again in a computation. BASIC DEFINITIONS--PATH EXPRESSIONS TO REPRESENT DATA FLOW This is a difficult problem in any practical ALGORITHMS TO SOLVE THE LIVE VARIABLE situation because of the complexity of exe- PROBLEM AND THE AVAILABILITYPROBLEM cution sequences, the aliasing of variables, SEGMENTATION OF DATA FLOW DETECTING ANOMOLOUS PATH EXPRESSIONS the use of external procedures, and other CONCLUSION factors. Thus a brute force attack on this ACKNOWLEDGMENTS REFERENCES problem is doomed to failure. Clever al- gorithms have been developed for dealing with this and related problems. They do not require explicit consideration of all execution sequences in the program in order to draw correct conclusions about the use of variables. Indeed, the effort expended in T scanning through the program to gather information is remarkably small. We dis- lar to that found in the literature of global cuss some of these algorithms in detail, program optimization, but our emphasis because they can be adapted to deal with and objectives are different. Specifically, our own set of problems in software re- execution of a computer program normally liability, and turn to these problems now. implies input of data, operations on it, and Data flow in a program is expected to be output of the results of these operations in consistent in various ways. If the value of a sequence determined by the program and a variable is needed at some computation the data. We view this sequence of events step, say the variable a in the step as a flow of data from input to output in which input values contribute to inter- 'y ~--- a-t- 1, mediate results, these in turn contribute to other intermediate results, and so forth then it is normally assumed that at an until the final results, which presumably earlier computation step a value was are output, are obtained. It is the ordered assigned to a. If a value is assigned to a use of data implicit in this process that is variable in a computation step, for example to ~, then it is normally assumed that the central object of study in data flow analysis. that value will be used in a later computa- Data flow analysis does not imply execu- tion step. When the pattern of use of vari- tion of the program being analyzed. In- ables is abnormal, so that our expectations stead, the program is scanned in a syste- of how variables are to be used in a compu- matic way and information about the use of tation are violated, we say there is an variables is collected so that certain in- anomaly in the data flow. Examples of data ferences can be made about the effect of flow anomalies are illustrated in the fol- these uses at other points of the program. lowing FORTRAN constructions. The first An example from the context of global opti- is mization will illustrate the point. This ex- ample, known as the live variable problem, X=A determines whether the value of some X=B variable is to be used in a computation

Computing Surveys, Vol. 8, No. 3, September 1976 L. D. Fosdick and L. J. Osterweil • 307

It is clear that the first assignment to X is similar anomalies could be embedded in a useless. Why is the statement there at all? large body of code in such a way as to be Perhaps the author of the program meant to very obscure. The algorithms we will de- write scribe make it possible to expose the pres- ence of data flow anomalies in large bodies of code where the patterns of data flow are X=A almost arbitrarily complex. The analysis is Y=B not limited to individual procedures, as is often the case in global optimization, but it extends across procedure boundaries to Another data flow anomaly is represented by include entire programs composed of many the FORTRAN construction procedures. The search for data flow anomalies can be- come expensive to the point of being totally SUBROUTINE SUB(X, Y, Z) impractical unless careful attention is Z=Y+W given to the organization of the search. Our experience shows that a practical approach Here W is undefined at the point that a begins with an initial determination of value for it is required in the computation. whether or not any data flow anomalies Did the author mean X instead of W, or are present, leaving aside the question of W instead of X, or was W to be in COM- their specific location. This determination of MON? We do not know the answers to the presence of data flow anomalies is the these questions, but we do know that there main subject of our discussion. We will see is an anomaly in the data flow. that fast and effective algorithms can be As these examples suggest, common constructed for making this determination programming errors cause data flow anoma- and that these algorithms identify the lies. Such errors include misspelling, con- variables involved in the data flow anomalies fusion of names, incorrect parameter usage and provide rough information about loca- in external procedure invocations, omission tion. Moreover, these algorithms use as of statements, and similar errors. The their basic constituents the same algorithms presence of a data flow anomaly does not that are employed in global optimization imply that execution of the program will and require the same information, so they definitely produce incorrect results; it im- could be particularly efficient if included plies only that execution may produce in- within an optimizing . correct results. It may produce incorrect Localizing an anomaly consists in finding results depending on the input data, the a path in the program containing the operating system, or other environmental anomaly; this raises the question of whether factors. It may always produce incorrect the path is executable. For example, con- results regardless of these factors, or it sider Figure 1 and observe that although may never produce incorrect results. The there is a path proceeding sequentially point is that the presence of a data flow through the boxes 1, 2, 3, 4, 5, this path anomaly is at least a cause for concern be- can never be followed in any execution of cause it often is a symptom of an error. the program. An anomaly on such a non- Certainly software containing data flow executable path is of no interest. The de- anomalies is less likely to be reliable than termination of whether or not a path is software which does not contain them. executable is particularly difficult, but often Our primary goal in using data flow analy- can be made with a technique known as sis is the detection of data flow anomalies. [8, 19, 22]. In symbolic The examples above hardly require very execution the value of a variable is repre- sophisticated techniques for their detection. sented as a symbolic expression in terms of However, it can easily be imagined how certain variables designated as inputs,

Computing Surveys, Vol. 8, No. 3, September 1976 308 • Data Flow Analysis In Software Reliability

anomaly on an executable path without resorting to symbolic execution is of con- siderable practical importance. While an anomaly can be detected me- chanically by the techniques we describe, the detection of an underlying error re- quires additional effort. The simple ex- amples of data flow anomalies given earlier make it clear that a knowledge of the in- tent of the programmer is necessary to identify the error. It is unreasonable to assume that the programmer will provide / in advance enough additional information about intent that the errors too can be C mechanically detected. We visualize the actual error detection as being done man- ually by the programmer, provided with information about the anomalies present in his program. Obviously, many tools could be provided to make the task easier, FIOURE 1. The path in this segment of a flow but in the end it must be a human who diagram represented by visiting the boxes in the determines the meaning of an anomaly. sequence 1, 2, 3, 4, 5 is not executable. Note that Y >_ 0 upon leaving box 1 and this condition is We like to think of a system which detects true upon entry to box 4, thus the exit labeled data flow anomalies as a powerful, thorough, T could not be taken. tireless critic, which can inspect a program and say to the programmer: "There is rather than as a number. The symbolic something unusual about the way you expression for a variable carries enough used the variable a in this statement. Per- information that if numerical values were haps you should check it." The critic might assigned to the inputs a numerical value be even more specific and say, "Surely could be obtained for the variable. Symbolic there is something wrong here. You are execution requires the systematic derivation trying to use ~ in the evaluation of this of these expressions. Symbolic execution is expression, but you have not given a value very costly, and although we believe further to a." study will lead to more efficient imple- The data flow analysis required for de- mentations, it seems certain that this will tection of anomalies also provides routine remain relatively expensive. Therefore a but valuable information for the documenta- practical approach to anomaly detection tion of programs. For example, it provides should avoid symbolic execution until it is information about which variables receive really necessary. In particular, with pres- values as a result of a procedure invocation ently known algorithms the least expensive and which variables must supply values to procedure appears to be: 1) determine a procedure. It identifies the aliasing that whether an anomaly is present, 2) find a results from the multiple definition of path containing this anomaly, and then COMMON blocks in FORTRAN programs. 3) attempt to determine whether the path It identifies regions of the program where some variables are not used at all. It recog- is executable. nizes the order in which procedures may We show that the algorithms presented be invoked. This partial list illustrates that here do provide information about the the documentation information provided presence of anomalies on executable paths. by this mechanism can be useful, not only While they do not identify the paths, the to the person responsible for its construction, fact that they can report the presence of an but also to users and maintainers.

Computing Surveys, VoL 8, No 3, September 1976 L. D. Fosdick and L. J. Osterweil • 309

We are ready now to enter into the de- For the graphs that will be of interest to us tails of this discussion. We begin with a it is usually true that ]E I is substantially presentation of certain definitions from graph less than IN 12; in fact it is customary to theory. Graphs are an essential tool in data assume that IEI g kINI where k is a flow analysis, used to represent the execu- small integer constant. tion sequences in a program. We follow For an edge, say (n,, nj), we say that this with a discussion of the expressions the edge goes from n, to n~ ; n, is called a we use to represent the actions performed predecessor of nj, and nj is called a successor on data in a program. The notation in- of n,. The number of predecessors of a troduced here greatly simplifies the later node is called the in-degree of the node, discussion of data flow analysis. Next, we and the number of successors of a node is discuss the basic algorithmic tools required called the out-degree of the node. For the for data flow analysis. Then we describe graph shown in Figure 2, 0 is the predecessor both a technique for segmenting the data of 1 and 2, the out-degree of 0 is two; 0 is flow analysis and the systematic applica- not a successor of any node, it has the in- tion of this technique to detect data flow degree zero. In this figure we also see that anomalies in a program. We conclude with a 4 is both a successor and a predecessor of discussion of the experience we have had with 1, and 2 is a successor and predecessor of a prototype system based on these ideas. itself. A node with no predecessors (i.e., in-degree = 0) is called an enry node, and a node with no successors (i.e., out-degree BASIC DEFINITIONS--GRAPHS = 0) is called an exit node; in Figure 2, 0 is the only entry node and 3 is the only Formally a graph is represented by G(N, E) exit node. where N is a set of nodes {nl, n2 ,... , nk} A path in G is a sequence of nodes nj 1 , and E is a set of ordered pairs of nodes n~,-.., n~k such that every adjacent called the edges, { (n~ , n~), (nj3 , n:4), "'" , pair (n~l, n~,+l) is in E. We say that this (n~_~ , n~m)}, where the n~,s are not neces- path goes from nj~ to n~k. In Figure 2, sarily distinct. For example, for the graph in 0, 2, 3 is a path from 0 to 3; 1, 4, 1, is a Figure 2, path from 1 to 1. There is an infinity of N -- {0, 1, 2, 3,4}, paths from 1 to 1: 1, 4, 1; 1, 4, 1, 4, 1; E = { (0, 1), (0, 2), (2, 2), (2, 3), (4, 2), etc. The length of a path i~ the number of (1, 4), (4, 1)}. nodes in the path, less one (equivalently, the number of edges) ; thus the length of the The number of nodes in the graph is repre- path 0, 1, 4, 1, 4, 2, 3 in Figure 2 is six. sented by J N[ and the number of edges If n~, n~2, ... , n~k is a path p, then any by ]E ]. For the graph in Figure 2, iN_ i1~ ,5 subsequence of the form n~,, nj,+~, ..., and I E ] = 7. For any graph ]E I _ ,, I n~,+~for 1 _~ i ~ kand 1 _~ m _~ k - i since a particular ordered pair of nodes is also a path, p'; we say that p contains the may appear at most once in the set E. path p'. If p is a path from n, to n~ and i -- j, o then p is a cycle. In Figure 2 the paths 1, 4, 1; 1, 4, 1, 4, 1, and 2, 2 are cycles. The path 0, 1, 4, 1, 4, 2, 3 contains a cycle. A path which contains no cycles is acyclic, and a graph in which all paths are acyclic is an acyclic graph. If every node of a connected graph has in- degree one and thus has a unique prede- FIGURE 2. Pictorial representation of a directed cessor, except for one node which has in- graph. The points, labeled here as 0, 1, 2, 3, 4, are called nodes, and the lines joining them are degree zero, the graph is a tree T(N, E). called edges. The graph in Figure 3 is a tree, and if the

Computing Surveys, VoL 8, No. 3, September 1976 310 • Data Flow Analysis In Software Reliability

0 many conference proceedings, especially those of the ACM Special Interest Group on the Theory of Computing. We now introduce some ideas and definitions drawn from this literature pertinent to the subsequent dis- cussion. When a graph is used to represent the flow of control from one statement to an- other in a program, it is called a flow graph. A flow graph must have a single entry node, but may have more than one exit node, and FIGURe. 3. Pictorial representation of a tree there must be a path from the entry node rooted at 0. Each node has a unique predecessor to every node in the flow graph. Formally, except the root which has no predecessor. a flow graph is represented by GF(N, E, no), edges (4, 1), (4, 2), and (2, 2) in Figure 2 where N and E are the node and edge sets, are deleted, then the resulting graph is also respectively, and no, an element of N, is the unique entry node. a tree. The unique entry node is called the Generally the nodes of a flow graph root of the tree and the exit nodes are called represent statements of a program and the the leafs. It will be recognized that there is edges represent control paths from one state- exactly one path from the root to each node ment to the next. In data flow analysis the .in a tree; thus we can speak of a partial order- flow graph is used to guide a search over mg of the nodes in a tree. In particular, if the statements of a program to determine there is a path from n, to n~ in a tree, then certain relationships between the uses of n, comes before nj in the tree; we say that data in various statements. Thus before n, is an ancestor of n~ and n~ is a descendent data flow analysis can begin, a correspond- of n,. In Figure 3 every node except 0 is a ence between the statements of a program descendent of 0, and 0 is the ancestor of all and the nodes of a flow graph must be of these nodes. Similarly 1 is an ancestor of established. Unfortunately, difficulties arise the nodes 2, 3, 4, 5, 6; on the other hand, 7 in trying to establish this correspondence is not an ancestor of these nodes. A tree because of the structure of the language which has been derived from a directed and the requirements of data flow analysis. graph by the deletion of certain edges, but Statements in higher level languages can of no nodes, is called a spanning tree of the graph. consist of more than one part, and not all These elementary definitions are com- monly accepted, but they are not universal. x=x+l.O Graph theory seems to be notorious for its IF(X LT Y) JfJ÷1 nonstandard terminology. Additional in- A=X*X formation on this subject can be found in n i various texts such as Knuth [24], and Harary [13]. nl+l~ The use of flowcharts as pictorial repre- rll+ 2 sentations of the flow of control in a com- I ~ puter program dates back to at least 1947 ni+3 q in the work of Goldstine and yon Neumann [11], and the advantage of the systematic application of graph theory to computer FIGURE 4. Graph representation of a segment programming was pointed out in 1960 by of a FORTRAN program. Node n, represents Karl) [21]. In recent years this approach has the statement X -- X + 1.0, node n,+~ repre- been actively developed with numerous sents the first part of the IF statement IF(X.LT.Y), node n,+z represents the second articles appearing in the SIAM Journal on part of the IF statement J = J -{- 1, and node Camputing, the Journal of the ACM, and n,+a represents the statement A = X,X.

Computing SurveyB. Vol 8, No. 3, September 1976 L. D. Fosdick and L. J. Os~rweil • 311 parts may be executed when the statement is executed. This is the case with the FORTRAN logical IF, as in IF(A .LE. 1.0)J = J -t- 1, where execution of the statement does not necessarily imply fetching a value of J from storage and changing it. For the pur- pose of data flow analysis it is desirable to separate such statements into their con- stituent parts and let each part be repre- sented by a node in Gp as illustrated in Figure 4 for this IF statement. Statements which reference external pro- cedures pose a far more serious problem. Such statements actually represent se- FIGURE 6. Illustration of a transformation which quences of statements. If a node in a flow replaces paths consisting of a single entry and a graph is used to represent an external pro- single exit by a node. In the transformed graph cedure, then some ambiguities in the data open circles have been used to identify nodes representing paths in the original graph. flow analysis arise because the control struc- ture of the represented external procedure Formally a call graph, which we represent is, so to speak, hidden. On the other hand, by Go(N, E, no) is identical to a flow if we permit this control structure to be graph. However the nodes and edges have completely exposed by placing its flow graph a different interpretation: using FORTRAN at the point of appearance of the referencing terminology, the nodes in a call graph repre- statement, then we invite a combinatorial explosion. Later we will discuss mechanisms sent program units (a main program and for propagating critical data flow informa- subprograms); an edge (n~, n~) represents tion across procedure boundaries in such a the fact that execution of the program unit n, will directly invoke execution of the pro- way as to avoid a combinatorial explosion, gram unit n~. This is illustrated in Figure 5. but at the price of losing some information. An important construction used here is the In data flow analysis the call graph is used call graph. to guide the analysis from one program unit to another in an appropriate order. In data flow analysis, transformations are C--~IAIN PROGRAM sometimes applied to a flow graph to reduce CALL SUBA~ ..) the number of nodes and edges, with nodes Y=X+FUNA( ) in the resulting graph representing larger segments of the program. One of these END SUBROUTINE SUBA( - ) transformations is illustrated in Figure 6. Here all nodes along paths from a node with Z=FUNB( )+l 0 a single exit to a node with a single entry END and containing only paths with this prop- FUNCTION FUNA( ) FUNB~~~'~°FUNA erty are collapsed into a single node. The YIFUNB( )-I 0 nodes in the transformed graph are called

END basic blocks [4, 6, 31]. The important and FUNCTION FUNB( ) obvious fact about a basic block is that it

END represents a set of statements which must be executed sequentially; in particular if FIGURE 5. Illustration of the call graph for a FORTRAN program. The nodes have been any statement of the set is executed, then labeled to identify the program unit repre- every statement of the set is executed in the sented. prescribed sequence. Maximality is implicit

Computing Surveys, Vol. 8, No. 3, September 1976 312 • Data Flow Analysis In Software Reliability in the definition of a basic block, i.e., no ing a node and pointing to a linked sublist additional nodes can be collapsed into the of successors of that node. The storage cost node representing a basic block, and the for this structure is IN I + 2 1E I words single entry, single exit condition is pre- if we assume that one word is used to store served. It follows easily that in a flow an integer. In practice it is necessary to graph in which every node is a basic block, associate information of variable length with either E = ~ (the empty set) or for every each node so we need to allow for a second (n,, n~)E E either the out-degree of n, pointer with each of the nodes, bringing the is greater than 1, or the in-degree of n~ storage cost to 2(IN I + I E I), so that if is greater than 1, or both of these condi- I E I ~ k]N ], the cost is less than or equal tions are satisfied. to 2(1 + k) IN I. Since there are no branches or cycles in a basic block, the analysis of data flow in it BASIC DEFINITIONS--PATH EXPRESSIONS is particularly simple. In some situations TO REPRESENT DATA FLOW the reduction of a flow graph in which nodes are statements to one in which the nodes When a statement in a program is executed, are basic blocks results in a significant re- the data, represented by the variables, can duction in the number of nodes. In such be affected in several ways, which we dis- cases there is a practical advantage in per- tinguish by the terms reference, define, and forming the data flow analysis on the basic undefine. When execution of a statement blocks first, then reducing the flow graph to requires that the value of a variable, say one in which the nodes are the basic blocks a, be obtained from memory we say that a and continuing the data flow analysis on the is referenced in the statement. When execu- reduced graph. However, we have found tion of a statement assigns a value to a that the average reduction in the node variable, say a, we say that a is defined in count for FORTRAN programs is about 0.56. the statement. In the FORTRAN statement Thus for FORTRAN it is not clear that a A=B+C significant advantage can be obtained by this initial preprocessing of basic blocks and B and C are referenced and A is defined, and reduction of the graph. In global optimiza- in the FORTRAN statement tion it is customary [4, 6, 31] to use basic I=I+l blocks, since an intermediate language, close to assembly language, is used and the I is referenced and defined. In the statement reduction in node count is significant. A(I) = B+ 1.0 Knuth [24] is the standard reference for data structures to represent graphs. Hop- B and I are referenced and A(I) is defined. croft and Tarjan [17] describe a structure The undefinition of variables is more com- that is particularly efficient for the search plex to describe, and we note here only a few instances. In FORTRAN the DO index algorithms described here and is illustrated in Figure 7. In this structure there is an becomes undefined when the DO is satisfied, ordered list of I N I elements, each represent- and local variables in a subprogram become undefined when a RETURN is executed. In ALGOL local variables in a block become undefined on exit from the block. We will want to associate nodes in a flow graph with sets of variables which are referenced, defined, and undefined when the node is executed. 1 In doing this the un- FIOURE 7. Data structure for the graphin Figure definition operation requires special ar- 2. This is a linked successor list representation. t Here and elsewhere we speak of nodes as if they The numbered entries could be replaced by were the objects they represent, thus avoiding pointers to a node table carrying ancillary in- cumbersome phrasing such as-- "... when the formation. statement represented by the node is executed."

Computing Surveys, Voi. 8, No 3, September 1976 L. D. Fosdick and L. J. Osterweil • 313 tention. Frequently, undefinition of a ment variable occurs not by virtue of executing a B- A(K) + 1.0 particular statement, but by virtue of exe- cuting a particular pair of statements in without looking elsewhere. That may be sequence. Consider, for example, the fol- hopeless if the program includes lowing FORTRAN segment: READ(5, 100)K DO 10 K = 1, N B = A(K) + 1.0 X = X + A(K) Y = Y + A(K)**2 For this reason we adopt the convenient, 10 CONTINUE but unsatisfactory, practice of treating all WRITE--- elements of an array as if they were a single variable. The abbreviations r, d, and u are used The DO index K becomes undefined when here to stand for reference, define, and the WRITE statement is executed after undefine, respectively. To represent a se- the CONTINUE statement, but it does not quence of such actions on a variable these become undefined when the statement abbreviations are written in left-right order X = X + A(K) is executed after the CON- corresponding to the order in which these TINUE statement. Thus it would be more actions occur; for example, in the FORTRAN appropriate to associate the undefinition statement with an edge in the flow graph rather than with a node. However, for consistency we A=A+B, prefer to associate undefinition with nodes, the sequence of actions on A is rd, while for therefore in the example above we would B the sequence is simply r. In the FORTRAN introduce a new node in the flow graph on program segment the edge between nodes for the CONTINUE and the WRITE and would associate with A=B+C this node the operation of undefinition of K. BffiA+D Similar situations in other languages can A=A+I.0 be handled in the same way. In the discus- B= A+2.0 sion which follows we assume that the un- GO TO 10 definition of variables takes place at specific the sequence of actions on A is drrdr and on nodes introduced for that purpose and that B it is rdd. We call these sequences path at such nodes no other operation, reference expressions. Habermann [12] has used this or definition, takes place. Thus, in particular same terminology in a different context. for a flow graph representation of a FORTRAN The path expressions purp', pddp ~, pdup', subroutine, we would introduce a node which where p and p' stand for arbitrary sequences would not correspond to any statement of r's, d's, and u's, are called anomalous but would represent the undefinition of all because each is symptomatic of an error as local variables on entry to the subroutine. discussed earlier. Our goal is to determine Similarly, at the subroutine exit a node whether such path expressions are present in representing undefinition of local variables a program. would be introduced. The problem of searching for certain Array elements pose a problem, too. patterns of data actions is common in the While it is obvious that the first element of field of global program optimization, a A is referenced in the FORTRAN statement subject which receives extensive treatment B = A(1) + 1.0 in a recent book by Schaeffer [31]. Recent articles by Allen and Cocke [6] and Hecht no particular conclusion can be drawn about and Ullman [16] discuss aspects of this which element is referenced in the state- problem that have particular relevance to

Computing Surveys, Vol 8, No 3, September 1976 314 • Data Flow Analysis In Software Reliability

our discussion. We focus on two problems in that the sets gen(n), kill(n) and null(n) global optimization: the live variable prob- are given. lem and the availability problem. We will For a path p and a token a we are in- show that algorithms used to solve these terested in the sequence of sets containing a problems can also be used for the efficient along the path. We traverse the path, and detection of anomalous path expressions. as each node n is visitied we write down g The live variable problem has already if a E gen(n), k if a E kill(n), and 1 if been sketched in the introduction to this a E null(n). The resulting sequence of gs, paper. The availability problem arises when ks, and ls is a path expression for a on p one seeks to determine whether the value which we denote by P(p; a). Here the alpha- of an expression, say a + /~, which may be bet used is {g, k, 1} instead of {r, d, u}. required for the execution of a selected state- Referring to Figure 8, the path expression ment actually needs to be computed, or foraonp = 0,1,2,4,2,5is may be obtained instead by fetching a pre- P(0, 1, 2, 4, 2, 5; a) = lkgkgk, viously generated and stored value for it. Since our specific interest in these problems and similarly, arises in the context of software reliability rather than global optimization, we prefer P(0, 1, 2, 5; ~) -- lklk. to characterize and define these problems in We use the notation of regular expressions a general setting which we now develop. (e.g., [18, p. 39]) to represent sets of path Consider a flow graph Gr(N, E, no). expressions. For example, the set of path With this flow graph we associate a set expressions for a on the set of all paths known as the token set, denoted by tok, leaving node 1 in Figure 8 is consisting of elements a,/~, .. • . With every node, n ~ N, we associate three disjoint sets: P(1 --% a) -- g(kg)*k + lk, gen(n), kill(n), and nuU(n), subsets of where it is to be noted that the k associated tok, with gen(n) u kill(n) u null(n) = tok. with node 1 is not included. Similarly, the This association is illustrated in Figure 8. set of path expressions for a on the set of Informally, one may think of the tokens as all paths entering node 5 in Figure 8 is representing variables in a program, and the sets gen(n), kill(n), and null(n) as P(-*5; a) = lkg(kg)* + lkl, representing certain actions performed on the tokens; for example, if the first action where it is to be noted that the k associated performed on a at node n is a definition with node 5 is not included. These too are then a E gen(n), if no action is performed called path expressions. We say a path ex- on a at node n then a E null(n), etc. The pression is simple if it corresponds to a single specific association of these sets with ele- path. It is evident that a simple path ex- ments of the program will depend on the presszon will not contam the symbols or +. problem under consideration, as we illustrate Path expressions are concatenated in an later. For the time being we simply assume obvious way. Thus, referring again to Figure 8, 0 P(1; a)P(1-*; a) = k(g(kg)*k + lk) &3 n 9en kzll null I~ve aviil 0 ~,B and Z P(--*5;a)P(5;a) --- (lkg(kg)* + lkl)k. 3 B 4 a,~ Two path expressions representing identi- 5 a,n cal sets of simple path expressions are FIOURE 8. Illustration of gen, kill and null sets equivalent. Thus, using the last path ex- assigned to the nodes of a simple flow graph. pression above, it is easily seen that The derived live and avail sets are shown in the last two columns. (lkg(kg)*+ lkl)k ~ lkg(kg)*k + lklk.

Computing Surveys, Vol. 8, No. 3, September 1976 L. D. Fosdick and L. J. Osterweil • 315

Furthermore, two path expressions differing sume that we can construct a flow graph only by transformations of the form for the program in which the nodes are statements or parts of statements, so that lg --~ g, 1/c --* k, gl -* g, kl --+ k, 11 --* 1, the following rules of membership for tokens and 1% 1--~1 representing variables can be trivially ap- are equivalent. For example plied at every node: 1+ l*gk + kkl + 11=--# + kk + 1. 1) a E kill(n) if a is referenced at n, or undefined at n; The final step in this general development 2) a E gen(n) if a is defined at n and is to introduce the sets live(n) and avail(n), a ~ kill(n); subsets of tok. For each a E tok and each 3) a E null(n) otherwise. n E N of GF(N, E, no). After these sets have been determined, o~ E live(n) if and only if P(n-*; a) suppose the live variable problem is solved. ~__ gP + pl, Now if a is defined at n and if a E live(n) and it follows easily that there is a path expres- sion of the form pddp' in the flow graph. if and onlyif a E avail(n) P(--~n; a) - pg, The truth of this conclusion is seen from where p and p~ stand for arbitrary path ex- the fact that a E live(n) implies P(n---*; a) pressions. In words, a E live(n) if and only - gp + p' and since g stands for a definition if on some path from n the first "action" on P(n---*; a) -dp + p', a, other than null, is g; and a E avail(n) if and only if the last action on a, other than hence null, on all paths entering n is g. These P(n; a)P(n--*; a) -- ddp + p". definitions are illustrated in Figure 8, where the live and avail sets are shown. Conversely, if at every node at which a is The live variable problem is: given G~,(N, defined a ~ live(n), then one may similarly E, no), tok, and, for every n E N, kill(n), conclude there is no path expression of the gen(n), and null(n) determine live(n) for form pddp~; i.e., there are no data flow every n E N. The availability problem is: anomalies of this type. given G~,(N, E, no), tok, and for every n E N, kill(n), gen(n), and null(n) deter- ALGORITHMS TO SOLVE THE LIVE VARIABLE mine avail(n) for every n E N. While one PROBLEM AND THE AVAILABILITYPROBLEM might solve these two problems directly in terms of the definitions, that is by deriving In the last section the live variable problem the path expressions and determining if and the availability problem were defined they have the correct form, such an approach and a simple example was given to show would be hopelessly slow except in the most how a solution to the live variable problem trivial cases. Instead, these problems are can be used to determine the presence or attacked by using search algorithms directly absence of data flow anomalies. In this sec- on G~ which avoid explicit determination of tion we describe particular algorithms for path expressions, but which do provide solving the live variable problem and the enough information about the form of the availability problem. Several such algo- path expression to solve the live variable rithms have appeared in the literature [6, problem and the availability problem. These 16, 23, 31, 35]. The pair of algorithms we algorithms are discussed in the next section. have chosen for discussion do not have the Before closing this section we show how lowest asymptotic bound on execution time. these tools are helpful with a simple ex- However, they are simpler and more widely ample. In this example the problem is to applicable than others and their speed is detect the presence of path expressions competitive. (now in terms of references, definitions, The algorithms involve a search over a and undefiuitions) of the form pddp'. As- flow graph in which the nodes are visited

Computing Surveys, Vol. 8, No. 3, September 1976 316 • Data Flow Analysis In Software Reliability

in a specific order derived from a depth first search. This search procedure is defined by the following algorithm, where it is assumed that a flow graph G~,(N, E, no) is given, and a push down stack is available for storage. Algorithm Depth First Search: 1. Push the entry node on a stack and mark it (this is the first node visited, FIGURE 10. Illustration of postorder and r-post- nodes are marked to prevent visiting order numbering of the nodes of a graph. The r-postorder numbers are in parentheses. them more than once). 2. While the stack is not empty do the following: Figure 10 the nodes are numbered in post- 2.1 While there is an unmarked edge order. This numbering could be generated in from the node at the top of the the following way. Introduce a counter in stack, do the following: the depth first search algorithm and initi- 2.1.1 Select an unmarked edge from alize it to 0 in step 1. In step 2.2, before the node at the top of the stack popping the stack, number the node at the and mark it (edges are marked top of the stack with the counter value and to prevent selecting them more then increment the counter. If each post- than once) ; order node number, say k, is complemented 2.1.2 If the node at the head of the with respect to I N I, i.e., k' ¢- IN { - k, selected edge is unmarked, then the new numbering represents an then mark it and push it on the ordering known as r-postorder [16]. This stack (this is the next node numbering is shown in parentheses in Figure visited) ; 10. 2.2 Pop the stack; The depth first spanning tree [33] of a 3. Stop. flow graph is an important construction for In Figure 9 the nodes of the flow graph the analysis of data flow. This construction are numbered in the order in which they are can be obtained from the depth first search first visited during the depth first search. algorithm in the following way. Add a set We follow the convention that the left- E-which is initialized to empty in step 1. most edge (as the graph is drawn) not yet In step 2.1.2 put the selected edge in E' marked is the next edge selected in step if the head of the selected edge is unmarked. 2.1.1; thus the numbering of the successor After execution of this modified de~.th first nodes of a node increases from left to right search algorithm, the tree T(N, E ) is the in the figure. The ordering of the nodes depth first spanning tree of Gp(N, E, no), implied by this numbering is called pre- the flow graph on which the search was order [24]. The order in which the nodes executed. The depth first spanning tree are popped from the stack during the depth of the flow graph in Figure 9 is shown in Figure 11. The edges in the set E - E' first search is called postorder [16, 24]. In fall into three distinct groups: 0 1) forward edges with respect to T: e E E - E', is in this group if this edge goes from an ancestor to a descendant of T; 2) back edges with respect to T: e E E -- E', is in this group if this edge goes from a descendant to an an- FIGURE 9. Numbering of the nodes of a graph in cestor of T, or if this edge goes from a the order in which they are first visited during a depth first search. This numbering is called node to itself; preorder. 3) cross edges with respect to T: e E

Computing Surveys, Vol. 8, No. 3, September 1976 L. D. Fosdick and L. J. Osterweil • 317

0 The notion of dominance which is intro- duced here is defined as follows. Given a pair of nodes n, and n~ in Gr, n, dominates n, if and only if every path from no to n~ contains n,. It can be easily seen from this theorem that the flow graph in Figure 9 is not reducible. Notice that the ed~ge (5, 4) is a back edge (cf. Fig. 12) with respect to FIGURE 11. Depth first spanning tree of the flow graph shown in Figure 9 Nodes are numbered the spanning tree in Figure 11. On the in preorder. other hand, node 4 does not dominate node 5; notice the path 0, 7, 8, 5. If this back E - E', is in this group if this edge edge is deleted, then the remaining graph is goes between two nodes not related reducible. The frequently mentioned para- by the ancestor-descendant relation- digm of a nonreducible flow graph is shown ship. in Figure 13. Some experiments [6, 25] have led to the These edges are shown in Figure 12 for the general belief that flow graphs derived from flow graph in Figure 9 and for the tree shown actual programs often are reducible. For in Figure 11 derived from it. Tarjan [34] flow graphs with this property, particularly has shown that it is possible to perform a fast algorithms have been developed [1, 23, depth first search, number the nodes in 35] for the live variable problem and the preorder, determine the number of descend- availability problem. Recently two al- ants for each node in the depth first span- gorithms for solving these problems on any ning tree, and determine the backedges, flow graph were presented by Hecht and forward edges, and cross edges, all in Ullman [16]. While these algorithms are not 0(Ihr I + [E l) time. always as fast as the others, they are com- This way of characterizing the edges in a petitive and they have the distinct advan- flow graph is particularly valuable for an analysis of data flow patterns. It is to be noted in particular that if the back edges are deleted in Figure 12, then the resultant graph is acyclic. This is true in general. The cycles in a graph cause the major complica- tion in the analysis of data flow. All of the data flow analysis algorithms would have 0(l E [) execution times if cycles were ab- FIGURE 12. Forward edges, back edges, and cross sent, but with cycles present they have edges marked by dashed lines and lettered f, b, execution times which generally grow faster e, respectively. This grouping is with respect than linearly in IEI as IEI --* ~. By to the tree shown in Figure 11. focusing attention on back edges one can more easily see how cycles add to the complexity of a data flow analysis al- gorithm. Some data flow analysis algorithms re- quire the flow graph to be reducible. This property is characterized in the theorem below, which follows from results of Heeht and Ullman [15]: THEOREM. GF is reducible if and only if n, dominates n~ in GF for each back edge (n~, n,), where j ¢ i, with respect to a depth first spanning tree of GF. FmURE 13. Paradigm of a nonredueible graph.

Computing Surveys, Vol. 8, No. 3, September 1976 318 • Data Flow Analysis In Software Reliability tages of simplicity and generality (they are We refer to the paper by Hecht and not restricted to reducible flow graphs). Ullman [16] for a proof of correctness of These algorithms are described below. this algorithm. Its operation is illustrated in The following algorithm [16] determines Figure 14 where the live sets are indicated the live sets of a flow graph. This algorithm before each execution of the step labeled assumes that the nodes have been numbered by (*). It is easily verified that the total 0, 1, ... , n in postorder and refers to the number of times step (*) is executed in nodes by the postorder number. this example is twelve: first the for loop is Here S(j) denotes the set of successors of executed six times, making one pass over node j, and ~/denotes the empty set. the six nodes, then since there was a change Algorithm LIVE: to the live sets a second pass is made, during for j ~-- 0 to n do live(j) ~ y~; which no change occurs to the live sets, change ~-- true; and this completes execution. while change do The correctness of the algorithm does not begin depend on the order in which the nodes change ,,--- false; are visited, but the execution time does. In for j *-- 0 to n do the simple example just considered it is begin easily verified that if the nodes were visited previous (-- live(j); in the order 5, 4, 2, 3, 0, 1 then eighteen (*) live(j) ¢- O ((live(k) executions of step (*) would be required; N (tok - kill(k))) U gen(k)); note that in this case a is not put in the live k E S(j) set of node 2 during the first pass. The if previous # live(j) then nodes are visited in postorder to ensure a change +- true; relatively rapid termination of the algo- end rithm. In particular, if the flow graph is end acyclic, then after the while loop is exe- stop cuted once M1 live sets are correct; one

,5

node gen k111 null 4 9 (l,~

>3 2 ~ B 3 L~ 4 ,,[~ oA 5 (~,B

llve sets before k th execution of step * in LIVE nod•ek= 1 2 3 4 5 6 7

2 ~ ¢ ¢ ,~ ~ ~ ~ further 3 ¢ $ ~, !, $ ¢ ¢ changes 4 @ $ ¢ $ @ ~,8 a,(~ 5 ¢ i $ } ¢ $ $

FIGURE 14. Illustration of the steps in the creation of the live sets by algorithm LIVE for a simple flow graph. Nodes are numbered ]n postorder. The correct live sets are obtained after five execu- tions of step *, however seven more executions are required before no change to the sets is recog- nized which then terminates execution.

Computing Surveys, Vol 8, No 3, September 1976 L. D. Fosdick and L. J. Osterweil • 319

more execution of the while loop is required termine the avail sets of a flow graph. This to establish that there are no further algorithm assumes that the nodes have been changes to the live sets. Thus for an acyclic numbered 0, 1, ..., n in r-postorder and flow graph the step (*) is executed 2 IN I refers to the nodes by the r-postorder num- times. If there is one back edge, then the ber. Here P(j) denotes the" se$ ~qf prede- effect of a gen can be propagated to a lower cessors of node j, and ~/denotes the empty numbered (in postorder) node, and it is not set. too difficult to see that upon completion Algorithm AVAIL: of the second (at most) execution of the avail (0) <-- $2~; while loop all live sets will be correct. Thus forj *-- 1 to n do avail(j) (--- tok; for a flow graph with one back edge the change <----true; step (*) is executed 31NI times at most. while change do Hecht and Ullman [16] have shown that if begin r is the number of times the step (*) is change <----false executed, then for j ~-- 1 to n do ~< (2 + d)lNt, begin where d is the largest number of back- previous <-- avail(j); edges in any acyclic path in the graph. For a avail(j) <--- N ((avail(k) reducible flow graph it has been shown [15] N (tok -- kill(k))) U gen(k)); that the back edges are unique, but if the k C P(j) flow graph is not reducible then the back if previous ~ avail(j) then edges will depend on the depth first spanning change <-- true tree. The d appearing above refers to the end back edges with respect to the spanning tree end generated to establish the postorder num- stop bering of the nodes. We refer again to the paper by Hecht and We now present an algorithm [16] to de- Ullman [16] for a proof of correctness of this

node 9en k111 null I

l a,t~ 2 a,~ 3 a B 4 ~,~ '4 S a,(~

avall sets before k th executlon of step + ~n AVAIL nod•ek-- I 2 3 4 ~ 6 7

1 ~,B B B B B ~ B no 2 a,B a,B @ @ ¢ ¢ ¢ further 3 a,B a,B ~,B ¢ ¢ ¢ changes 4 ~,B e,B a,B a,B ~ ~

FIGURE 15. Illustration of the steps in the creation of the avail sets by algorithm AVAIL for a simple flow graph. Nodes arc numbered m r-postorder. The correct avail sets are obtained after five ex- ecutions of step *, however seven more executions are required before no change to the sets is recognized which then terminates execution.

Computing Surveys, Vol. 8, No. 3, September 1976 320 • Data Flow Analysis In Software Reliability algorithm. Its operation is illustrated in the program is a natural basis for the seg- Figure 15 where the avail sets are indicated mentation of the data flow analysis. Here before each execution of the step labeled by we describe how this is done in such a (*). Here, as with the example for LIVE way as to permit detection of data flow it is easy to verify that step (*) is executed anomalies on paths which cross procedure twelve times. With the exception of the boundaries. We will see that the system for entry node, which is treated separately, it doing this naturally includes the detection of does not matter in what order the remaining data flow anomalies on paths which do not nodes are visited in the while loop so far cross procedure boundaries. In this section as correctness is concerned, but it does we describe the identification and repre- matter for the execution time. Again, the sentation of the data flow, and in the next back edges are a critical factor. With r and section we describe the detection of anoma- d as defined before, Hecht and Ullman [16] lous data flow. show that We make several assumptions at the out- set. The first concerns aliasing, the use of r ~ (2+d)(tN I -- 1). different names to represent the same Empirical evidence obtained by Knuth datum. In crossing a procedure boundary [25] leads Hecht and Ullman [16] to the the name of a datum typically changes from conclusion that in practice one can expect the so-called actual name used in the in- d ~< 6 and on the average d ~ 2.75 for voking procedure to the so-called dummy FORTRAN programs. However, it is to be name used in the invoked procedure. It is noted that there are pathological situations, assumed here that the aliases for a datum as shown in Figure 16, for which the execu- are known and that a single token identifier tion time is much larger than these numbers is used to represent them. Thus, in particu- indicate. lar, in our notation for representing actions on a token a along some path p we use SEGMENTATION OF DATA FLOW P(p; c~) even when p crosses a procedure boundary and the datum represented by Normally a program consists of a main is known by different names in the two program and a number of subprograms or procedures. The second assumption we make external procedures. This segmentation of is that the procedures under consideration have a single entry and a single exit. We nl could permit multiple entries and multiple exits, but it would complicate the discussion without adding anything really important to it. While we will discuss the segments as if they were procedures, it will be obvious that the discussion applies equally well to any single-entry, single-exit segment of a pro- gram. Our most restrictive assumption is that the call graph for the program is acyclic. This excludes recursion. We will discuss this restriction later. Let us consider a flow graph GF(N, E, no) in which some node invokes an external procedure, as illustrated in Figure 17. In nt order to analyze the data flow in Gp(N, E, no) it is necessary to know certain facts about FIGURE 16. Pathological situation in which the the data flow in the invoked procedure. In execution time for the availability algorithm is particular we need to know enough about unusually long. Here d = 1 N -- 13, r = (I N ] -- 1) 3 assuming a s gen(n,) and a ~ kill(n~) and is the data flow in the invoked procedure to be not in any other gen or ]:ill sets. able to detect anomalous patterns in the

Computing Surveys, Vol. 8, No 3, September 1976 L. D. Fosdick and L. J. Osterweil • 321

G F (N, E, n o ) G~ (N'. E: no') permit recognition of situations where an anomalous path expression exists on all paths entering or leaving a node. This recog- nition is important because it permits us to I conclude something about the presence or absence of anomalous path expressions on executable paths. Figure 1 makes it clear ) that some paths in a flow graph may not be executable, and it is evident that anomalous path expressions on them are not important. Only anomalous path expressions on execut- able paths are important as indicators of a possible error. Unfortunately, the recogni- tion of executable paths is difficult2, but if we make the reasonable assumption that FIGURE 17. At node n in GF(N, E, no) an external procedure is invoked. The flow graph of the every node is on some executable path, then invoked procedure is represented by Gr' (N', E', if all paths through a node are known to be no'). anomalous we may draw the very useful conclusion that there is an anomalous ex- flow across the procedure boundaries. Re- pression on an executable path. Certain ferring to GF in Figure 17 and considering a additional forms for path expressions need single token a, it becomes evident that we to be. distinguished to achieve this. We ob- need to recognize three cases to detect serve, for example, that if anomalous patterns of the form purp' on P(---m; a) -- pu and P(n; a) -- rp', paths crossing the procedure boundary: a) P(--~n; ~) ~ pu + p', then on every path in Gr of the form no, • • • , n, .. • there is an anomalous path expression P(n; a) ~ rp -~ p'; purp'. Thus it would be desirable to be able b) P(n;a) ~ pu + p', to distinguish the form rp. Notice also that if P(n--% a) ~ rp + p'; P(--*n;a) -pu, P(n;a) --rp' + 1, c) P(--~n;a) -= pu + p,t P(n---% a) - rp, P(n; a) -~ 1 "t- p, then the same conclusion can be drawn, so P(n--*; c~) ~ rp -k p. it is also desirable to distinguish the form Thus all we need to know about the data rp -4- 1. Similar considerations show the need flow in the invoked procedure is whether for recognizing the forms pu, pu + 1, 1 and P(n; ~) has one of the following three similar considerations for anomalous ex- forms: rp W p', pu Jr p', 1 W p'. The last pressions of the form pddp', and pdup' lead form represents the situation in which there to corresponding forms involving d and u. is at least one path through the invoked Collecting these results leads to the seven procedure on which no reference, no defini- forms for path expressions shown in Figure tion, and no undefinition of a takes place. 18. Corresponding sets A~(n), B~(n),..., It is evident that a particular P(n; a) I(n) which are subsets of the token set are could have more than one of these forms, defined as follows: e.g., ru W 1 has all three forms. Similar a E A~(n) if P(n;a) - xp; consideration of the problem of detecting a EBb(n) if P(n;a) = xp ~- 1; anomalous path expressions of the form pddp' and pdup' leads to the conclusion that a E I(n) if P(n;a) = l(n). the following additional forms for P(n; a) need to be recognized: dp ~ p', pd + p, up ~ p'. = Indeed this problem is not solvable in general, for if we could solve it we could solve the halting We now wish to extend these ideas to problem [18].

Computing Surveys, Vol. 8, No. 3, September 1976 322 • Data Flow Analysis In Software Reliability

label path expression can augment Gr by attaching a new entry node and new exit node with these prop- A xp x erties without affecting the data flow pat- terns. The algorithms for determining the Bx xp+l path sets are presented informally below. CX xp+p ' They are presented in alphabetic order; however, as will become apparent, a differ- Dx px ent order is required for their execution. Ex px+l A satisfactory execution order is A,(n'), C~(n'), B~(n'), D~(r[), F~(n'), E,(~), F px+p ' x I(n'). I l Algorithm Determine A x( n') FIOUI~B 18. Seven forms for path expression in 1) for all n such that n E N - {n.xit} do single-entry, single-exit flow graphs and labels null(n) (--- I(n) U B~(n) ; used to identify them. The parameter x stands for r, d, or u. kill(n) ~-- A~(n) ; gen(n) ~ tok -- (kill(n) U null(n)) ; These sets are called path sets. Although this 2) null(n~it) (----f2~; classification scheme was developed for kill(noxit) ~ ;~; situations in which n represents a procedure gen( ne,,it) ¢--- tok; invocation as illustrated in Figure 17, it 3) execute LIVE on Gp ; will be recognized that it applies when n is a 4) A,(n') ~ tok - live(no) ; simple node representing, say, an assignment {comment--the null sets are not statement• For example if n represents explicitly needed but are included a ~-- a + # and tok ffi {a, #, ~}, then here for clarity}. a E Ar(n), a E C,(n) fl E A~(n), Algorithm Determine B~( n') E C,(n) aE Dd(n), aE Fd(n) 1) for all n such that n E N do E I(n). null(n) ~-- I(n) U B,(n) ; In such simple cases membership in the sets kill(n) (----Ax(n) ; can be determined by rather obvious rules. gen(n) *--- tok On the other hand, when the node represents - (kiU(n) U null(n) ) ; a procedure invocation, determination of 2) execute LIVE on Gr ; membership in the path sets requires an 3) B~(n') ~ (tok - live(no)) analysis of the data flow in the procedure. [7 (tok -'a,(n') ) fl C,(n'). It is this problem to which we now direct Algorithm• Determine. C,( n ! ) our attention. 1) for all n such that n E N do Suppose the path sets for node n' are to gen(n) ~ C~(n) ; be determined. We assume that n' repre- kill(n) ~ (Av(n) U A,(n)) ; sents the invocation of an external procedure {comment--x, y, z is any per- with flow graph Gr and that the path sets mutation of r, d, ul for the nodes of Gr have been determined null(n) ,,-- tok already. (Our focus of attention now shifts - (gen(n) U kill(n))" to the invoked procedure and to avoid an 2) execute LIVE on Gr ; excess of primes we have switched the role 3) C~(n') ~ live(no). of primed and unprimed quantities shown in Algorithm Determine D,( n') Fig. 17.) We also assume that no data ac- 1) for all n such that n E N do tions take place at the entry node and exit gen(n) ~-- D,(n) ; node of Gr ; thus kill(n) (---(F,(n) U F,(n)); I(no) = tok and I(n~xit) -- tok. {comment--x, y, z is any per- mutation of r, d, u} This assumption is not restrictive since we null(n) (-- tok

ComputingSurveys, Vol. 8, No. 3, September 1976 L. D. Fosdick and L. J. Osterweil • 323

- (gen(n) U kill(n)) ; =-- xp. We observe P(n'; a) ffi P(no; a) 2) execute AVAIL on Gr ; P(no-~; a) = P(no -,; a), the last equal- 3) D~(n') ~-- avail(nox,t). ity following from the fact that no data Algorithm Determine E,( n') action takes place at no. Thus a E 1) for all n such that n E N - {no} do live(no) =~ P(n'; a) == xp; i.e., a E gen(n) ~-- D~(n) ; A~(n'). Now consider the "only if" part. kill(n) (-- Fv(n) U F,(n) ; a E live(no) ~ P(no--*; a) ------gp + p' {comment--x, y, z is any per- Consequently, because of the construction mutation of r, d, u} of the gen sets P(no~; a) -- yp + p' null(n) (-- tok where y # x. (Note that a E gen(n) implies, by step 1, P(n; a) -xp, - (gen(n) U kill(n) ) ; 2) gen(no) (-- $ok; P(n; (~) - xp + 1, P(n; a) - 1).Con- kill(no) ~- ~; sequently, a E live(no) ~ P(n'; a) null(no) ¢--- 25; -- yp + p' # xp; i.e., a ~ A~(n').~ 3) execute AVAIL; Proof Determination of B,(n') is correct 4) E~(n') ~-- avail(n~xi,) Let a E tok. By step 3 of the algorithm N (tok - D,(n')) N F,(n'). o~ E B,(n') if and only if a E live(no) and Algorithm Determine F,( n') ~ A x(n ~) and o~ E C,(n') Take the "if" 1) for all n such that n E N - {no} do part first: a ~ live(no) ~ P(no-o; a) = kp, gen(n) (--- D~(n) U D,(n); orP(no--*; ~) = kp + 1, orP(no-o; a) = 1. {comment--x, y, z is any per- The first and last alternatives are ex- cluded by the conditions a ~ A,(n') and mutation of r, d, u} t kill(n) (--- F,(n) ; ot E C,(n ). This leaves only,P(no--*; a) null(n) (-- tok = kp + 1 and using aEkill(n) -- (kill(n) U gen(n) ) ; P(n;a) = xpgivesP(no---*;a) -- xp + 1. 2) gen(no) *--- tok; Finally kill(no) *--- ~; P(n'; a) = P(no; ~)P(no--*; ~) null(no) *-- 25; -~ P ( no---* ; a) 3) execute AVAIL; P(n'; a) -~ xp + 1; 4) F,( n') ~-- tok - avail(ned,t) Algorithm Determine I ( n') i.e., a E B,(n'). Now consider the "only 1) I(n') ~ N I(n). if" part. hEN a E live(no) ~ P(no---~; a) = gp + p'. Since LIVE and AVAIL terminate, it is obvious that these algorithms terminate. From step 1 it is seen that Proofs of correctness for some of these al- gorithms are presented below. a E gen(n) ~P(n;a) - yp+p', y #x. Proof Determination of A~(n') is correct Hence a E live(no) ~ P(no-*; a) -- Let a E tok. By step 4 of the algorithm yp + p', y # x, and from this it is a E A~(n') if and only if a ~ live(no). easily concluded that a ~ B,(n'). It is Take the "if" part first: immediately evident that a E A~(n') a E Bx(n') and that a ~ C~(n') ~ a E ~ live(no) ~ P(no-*; a) # gp + p' B,(n t ).[] P(n0--~; a) =--- kp or P(no-*; a) --= kp + 1, Proof Determination of D~(n') is correct Let a E tok. By step 3 of the algorithm or P(no--% a) ------1. The last two alterna- tives are ruled out because by step 2 of the a E D~(n') if and only if a E avail(nex,O. algorithm gen(n~,,t) = tok. Consequently, Take the "if" part first. because of the construction of the k,ll sets in step 1, a ~ live(no) ~ P(n0-*; a) a E avail(ne,~it) ~ P(-~nex,t; a) - pg.

Computing Surveys, VoL 8, No. 3, September 1976 324 • Data Flow Analysis In Software Reliability

Hence DETECTING ANOMALOUS PATH EXPRESSIONS P(n'; a) -- P(-'-~nexlt; ~)P(nex,t; a) -- pg. It will be recalled that we have defined an anomalous path expression to have one of the Now using the fact that a E gen(n) ! f ? P(n; a) ffi px we conclude that forms: purp, pddp, or pdup. Let us assume now that the path sets have been determined a E avail(nexit) ~ P(n'; a) - px; i.e., for every node of a flow graph G~,(N, E, no). a E Dx(n'). It should be evident that if (n, n') E E and a E Fu(n) and a E Cr(n'), then there Now take the "only if" part. is a path expression of the form purp': a ~ avail(ne.it) ~ P(--~ne, it; a) -- pk T p' a E Fu(n), a E Cr(n') =* P(nn'; a) - or P(-~ne~,t; a) --- 1. purp' -[- pip. Note, however, that the un- definition and reference do not necessarily Since a E kill(n) implies P(n; a) ! occur on nodes n and n' respectively. In- py ~ p, y ~ x, it easily follows that deed, these data actions may not even occur a ~ avail(n~z,t) ~ P(n'; a) ~ px; i.e., on nodes of this flow graph: they might ? a ~ Dx(n ).~ occur on nodes of other flow graphs repre- senting invoked procedures. We only know The last item to be discussed in this sec- that on some path which includes the edge tion is the initiation and progressive de- (n, n') there is an anomalous path ex- termination of the path sets for a program. pression. Also this anomalous path expres- Consider the call graph shown in Figure 5. sion may not be on an executable path, but Since the subprogram FUNB invokes no if a E D~(n) and a E Ar(n'), then we may other subprogram, the algorithms just reasonabl~" conclude that the path expres- presented are unnecessary in the determina- sion purp occurs on an executable path. tion of the path sets for the nodes of the In this case our assumptions imply that on flow graph representing FUNB. In this flow every path which includes the edge (n, n') graph each node will represent a simple there must be an anomalous path expression: statement or part of a statement having no E Du(n), a E Ar(n') ~ P(nn'; a) ---- underlying structure, so the path set de- purp'. We assume at least one of these paths termination can be made by inspection. is executable. In this section these ideas are Once this is done the path sets for the nodes expanded to include the detection of of the flow graph representing SUBA can be anomalous path expressions on paths which determined, since the path sets are known go through a selected flow graph. for the only subprogram it invokes. The Assume that the path sets have been same remarks apply to the flow graph repre- constructed for a flow graph Gp, and we senting FUNA. Finally, after these path wish to determine whether sets are determined it is possible to deter- mine the path sets for the nodes of the flow P(n; a)P(n---*; a) =- pxyp' "~ pH graph representing MAIN. Thus by working backwards through an acyclic call graph it or is possible to apply the algorithms just P(n; a)P(n---~; a) -- pxyp' described. We call this backward order the leafs-up subprogram processing order. We for each n E N and each a E tok. For anom- have restricted our attention to acyclic call aly detection we are interested in those graphs because this procedure breaks down cases whenx = u,y - rorx = d,y = d, if a cycle is present in the call graph. One or x = d, y = u, but there is no need to fix way to solve this problem if cycles are the values of x and y now. A similar, but not present might be to carry out an iterative equivalent, pair of problems is to determine procedure, as suggested by Rosen [30], in whether which successive corrections are made to P(-*n; a)P(n; a) ~- pxyp' -~- p" some initial assignment of path sets but we have not pursued this idea. or

Computing Surveys, Vol 8, No 3, September 1976 L. D. Fosdick and L. J. Osterweil • 325

P(---~n; a)P(n; a) - pxyp' last data actions on paths entering or leav- ing a flow graph. Therefore, if we are to for each n E N and each a C tok. The dis- detect the presence of all anomalous path cussion of the last section should make it expressions in an entire program by the apparent that the first pair of problems can method just described, we must apply it be attacked with the algorithm LIVE and systematically to the flow graphs for each the second pair of problems can be attacked of the subprograms in the entire program. with the algorithm AVAIL. Indeed, the In practice this would be done in the order algorithms presented in the last section have, dictated by the call graph, as already dis- in effect, solved these problems. cussed in connection with constructing the Consider the algorithm to determine path sets. Indeed, these two processes would A~(n'). After execution of step 3, suppose be done together while working through the we construct the sets subprograms. To illustrate, consider the call A~(n--~) = tok - live(n) graph shown in Figure 5. The steps per- formed would be as follows: for all n E N. Note that in step 4 we did this for the entry node only. It is evident that 1) For FUNB determine the sets a E Ax(n--~) implies P(n---~; a) = xp, Ax(n') ... I(n'), Ax(n--~), and conversely. Hence if a E Dr(n) and Cx(n-*), D~(-,n), F~(--m); a E A~(n---~) we know that 2) For FUNB construct the sets F~(n) N Cy(n--~), ... , D~(---m) P(n; a)P(n---% a) ~-- pyxp' NAy(n) and report anomalies; and so if y = u and x = r, an anomalous 3) Repeat steps 1 and 2 for SUBA; path expression of the form purp' is known 4) Repeat steps 1 and 2 for FUNA; to be present. 5) Repeat steps 1 and 2 for MAIN. Now, using the idea and notation sug- The time required to do the detection of gested in the last paragraph assume that anomalous path expressions is essentially we augment the last step in the algorithm controlled by the time required to execute for A~(n'), C~(n'), D~(n'), and F~(n') LIVE and AVAIL. Step 1 of the example described in the last section to construct the described above requires nine executions of sets A~(n--~), C~(n---~), D~(----~n), F~(---*n). LIVE (A~, B~, C~ for x -- r, d, u), and Using them we construct the set intersec- nine executions of AVAIL (D~, E~, F~ for tions: F~(n) Iq C~(n--~), D~(n) rl A~(n.-o), x = r, d, u), plus a small additional amount F~(---*n) f'l C~(n), and D~(---~n) Iq Ay(n). of time proportional to the number of nodes Then it is seen that: in the flow graph. We are assuming that the a E Fx(n) fl C~(n---*) set operations can be done in unit time so ¢:* P(n; a)P(n---*; a) - pxyp' q- p"; there is no dependence on the number of a E D~(n) Iq Ay(n---*) tokens. In practice this assumption has only ¢=* P(n; a)P(n--*; a) - pxyp'; limited validity. Step 2 of the example ,~ E F~(--,n) N C~(n) described above requires a time proportional ¢=* P(--m; a)P(n; a) -- pxyp' q- p"; to the number of nodes (in particular a E D~(--m) fl A~(n) 4( I N ] - 2) where the -2 term arises be- ¢=* P(---m; a)P(n; a) - pxyp'. cause we can ignore the entry and exit nodes). Therefore, if a call graph has I N. I The proofs of these assertions, which we nodes and I~1 is the average number of omit, are essentially the same as those given nodes in each flow graph represented by a in the previous section, Segmentation of node of the call graph, the time • to detect Data Flow, for the determination of the sets all anomalous path expressions may be ex- A~(n'), .... pressed as It will be recognized that the segmenta- r -- IN, 1(9r,.,w + 9rAvAIL + k I h7 I), tion scheme described in the previous sec- tion permits exposure only of the first and where rL~v~ and rxvxm are execution times

Computing Surveys, Vol. 8, No. 3. September 1976 326 • Data Flow Analysis In Software Reliability

for LIVE and AVAIL. If we use the results Similarly, it is possible to determine the given in the section Algorithms to Solve the arguments which are assigned values by a Live Variable Problem and the Availability procedure, i.e., the output arguments. How- Problem for the execution times for LIVE ever, unlike the case for initialization where and AVAIL, we see that in practical situ- the set Cr(n t) identifies the arguments re- ations we can expect to detect the presence quiring initialization, none of the path sets is of all anomalous path expressions in a sufficient for this purpose. Notice in par- program in a time which is proportional to ticular that Fd(n ) is not satisfactory be- the total number of flow graph nodes. While cause P(n'; a) - pdr obviously implies the constants of proportionality might be that a is an output for the procedure repre- large and there would be a substantial over- sented by n' yet a ~ Fd(n'). However, it is head to create the required data structures, not difficult to construct an algorithm for the important point is that a combinatorially this purpose. Indeed, we only need to explosive dependence on IN I has been modify one step in the algorithm for avoided. Fx(n'); in particular, replace gen(n) (-- The principal reason why a combinatorial Dr(n) (J D~(n) by gen(n) (----Du(n). explosion has been avoided is that we have Then after step 4, a E Fd(n') implies a not looked explicitly at all paths. The loss is an output for the procedure represented of information resulting from this does not by n'. It will be recognized that this ex- prevent us from detecting the presence of cludes tokens for which P(n'; ~) =- pdr*u. anomalous path expressions, but it greatly This is reasonable, since the definition is restricts our knowledge about specific paths destroyed by the subsequent undefinition, on which the anomalous path expression and no value is actually returned to the occurs. Thus if a E Fu(n)N Cr(n--,), we invoking procedure. Thus we have a mecha- know that on some path starting at n we nism for providing automatic documentation will find an expression of the form purp', but about procedure outputs, or for verifying we do not know which path and we do not assertions about which procedure arguments know which nodes on the path contain the are output arguments. actions u and r on a. This problem can be attacked directly by performing a search CONCLUSION over paths starting at node n. This search can be made quite efficient if we deal with As noted in an earlier section of this paper, one token at a time. The idea is to use a we have implemented a FORTRAN program depth first search but to restrict it so that analysis system which embodies many of the we avoid visiting any node n' such that ideas presented here. This system, called a ~ C~(n'--*). While this strategy does not DAVE, [27, 28] separates program variables preclude backtracking, it tends to reduce it into classes that are somewhat similar to and generally restricts the number of nodes those shown in Figure 18. DAVE also detects visited in the search. It seems certain that all data flow anomalies of type purp' and more efficient schemes for localizing the most of the data flow anomalies of types anomalous path expression can be con- pddp' and pdup'. DAVE carries out this structed. analysis by performing a flow graph search The information gathered for the detection for each variable in a given unit, and analyz- of anomalous path expressions is valuable ing subprograms in a leafs-up order, which for other purposes. For example, it deter- assures that no subprogram invocation will mines which arguments need initialization be considered until the invoked subprogram before execution of a procedure--thus it has been completely analyzed. An improved could be used to supply this information as version of DAvE would continue to analyze a form of automatic documentation. Al- the subprograms of a program in leafs-up ternatively, this information can be used to order, but would use the highly efficient, verify assertions by the programmer con- parallel algorithms described here to either cerning arguments needing initialization. detect or disprove the presence of data flow

Computing Surveys. VoL 8, No 3, September 1976 L. D. Fosdick and L. J. Osterweil • 327 anomalies. The variable-by-variable depth suppose that sophisticated representation first search currently used in DAv~ exclu- schemes could be used to reduce the very sively, would be used only to generate a large time and space requirements of the specific anomaly bearing path, once the symbolic execution system, it also seems more efficient algorithms had shown that an clear to us that even such reduced require- anomaly was present. Such a system would ments would necessarily greatly exceed have considerably improved efficiency char- those of our proposed system. We have acteristics and, perhaps more important, finally concluded that symbolic execution could be readily incorporated into many systems currently seem more attractive" as existing which already do live stand alone diagnostic systems where their variable and availability analysis in order to greater level of detail can be used to carry perform global optimization. out more extensive program analysis, but at The apparent ease with which our anomaly greater cost. We believe, moreover, that our detection scheme could be efficiently in- proposed data flow analysis scheme can and tegrated into existing optimizing compilers should be integrated into compilers in order is a highly attractive feature and a strong to provide highly useful error diagnosis at argument for taking this approach. Other small additional cost. The diagnostic output methods for carrying out anomaly detection of a system such as ours would then be useful can be constructed, but most that we have input to a symbolic execution system. studied lack efficiency and compatibility Much has been learned from our exper- with existing compilation systems. One such iences with the current version of DAVE. method, which is quite interesting for its Believing that similar systems should be strong intuitive appeal, involves symbolic used in state-of-the-art compilers, we now execution of the program. Symbolic execu- summarize these experiences in order to tion, a powerful technique which has place in better perspective the problems and recently found applications in , benefits to be expected. program verification, and validation [8, 19, Certain programming practices and con- 22], involves determining the value of each structs which are present in FORTRAN and program variable at every node of a flow common to a number of other languages graph as a symbolic formula whose only cause difficulties for data flow analysis unknowns are the program's input values. systems such as DAVE. The handling of These formulas of course depend upon the arrays, as mentioned earlier, is one such path taken to a given node. A notation simi- example. Problems arise when different ele- lar to regular expression notation could be ments of the same array are used in in- used to represent the set of symbolic ex- herently different ways and hence have pressions for a variable at a node, cor- different patterns of reference, definition, responding to the set of paths to the node. and undefinition. Static data flow analysis If these expressions were to be stored at systems such as DAVE are incapable of their respective nodes, a flow graph search- evaluating subscript expressions and hence ing procedure could be constructed which cannot determine which array element is would be capable of detecting all the anoma- being referenced by a given subscript ex- lies described here by careful examination of pression. Thus, as stated earlier, in DAVE the way the expressions evolved along paths and in many other program analysis systems traversed by a single flow graph search. arrays are treated as though they were Moreover, because the symbolic execution simple variables. This avoids the problem of carried along far more information than being unable to evaluate subscript expres- does our proposed system, even more sions, but often causes a weakening or powerful diagnostic results are possible. blurring of analytic results. As an example, The relative weaknesses of such a method consider the program shown in Figure 19. are its lack of efficiency and the difficulty of Suppose n t is the node of GMAIN(N, E, no), incorporating it into existing compiling the flow graph of the main program, which systems. Although it seems reasonable to invokes SQUARE. Denote by R(., 1) and

Computing Surveys, Vol, 8, No, 3, September 1976 328 • Data Flow Analysis In Software Reliability

DIMENSION R(lO0,2) ments, this order may become difficult to determine. This difficulty can arise because READ(5,10)(R(I,]),I:I,]O0) the name used in a subprogram invocation 10 FORMAT(FIO.2) may not be the name of a subprogram, but rather can be a variable which has received CALL SQUARE(R) the subprogram name, perhaps through a WRITE(6,20)(R(I,2),I=I,IO0) long chain of subprogram invocations. All 20 FORMAT(IX,FIO.2) such chains must be explored in order to expose all subprogram invocations and then STOP determine the leafs-up order. Recent work END by Kallal and Osterweil [20] indicates that the AVAIL algorithm can be used to effi- SUBROUTINE SQUARE(R) ciently expose all such invocations. DIMENSION R(IO0,2) Recursive subprograms pose another ob- stacle to determining leafs-up order. Al- DO I0 I=I,I00 though recursion is not allowed in FORTRAN, I0 R(I,I)=R(I,2)**2 it is a capability of many other languages. Moreover, it is possible to write two RETURN FORTRAN subprograms such that each may END invoke the other, but such that no program execution will force a recursive calling se- FIGURE 19. A program in which failure to dis- tinguish between the differing patterns of quence. Such a program would be legal in reference, definition and undefinition of differ- FORTRAN, but would not appear to have ent array elements prevents the detection of sufficient leaf subprograms (i.e., those that data flow anomalies. invoke no others) to allow construction of the complete leafs-up order. This problem R(., 2) arbitrary elements of column 1 is not adequately handled by DAVE, how- and column 2 respectively of array R. Now ever no FORTRAN programs with this con- clearly R(., 1) E Ad(n') and R(., 2) E struction have been encountered. In any case A,(n'). In addition, it is clear that R(., 1) E current work indicates that recursive pro- Dd(--*n~) and R(-, 2) E Du(-*n'). Hence grams can be analyzed using the methods P(-m'; R(., 1))P(n'; R(., 1)) -- pddp', described here. and P(-~n'; R(., 2))P(n'; R(., 2)) --- Finally it should be observed that sub- ! purp, and we see there are two data flow program invocations involving the passing anomalies present. DAVE, however, treats of a single variable as an argument more R as a simple variable and determines that than once may be incorrectly analyzed. R E Ad(n'), R E Dd(n'), R E Dd(---~n') This occurs because DAVE assumes that all andR E A,(n--*)., Thus P(---*nI ;R)P(n';R) subprogram parameters represent different - pdrp' and P(n'; R)P(n'---~; R) -- pdrp, variables as it analyzes subprograms in and no data flow anomalies will be detected. leafs-up order. This loss of anomaly detection power is Despite these limitations, the DAvE sys- worrisome, and it is seemingly avoided only tem has proven to be a useful diagnostic when programmers call functionally distinct tool. We have used DAVE to analyze a subarrays by separate names. number of operational programs and it There are also certain difficulties involved has often found errors or stylistic short- in determining the leafs-up subprogram comings. Among the most common of these processing order referred to earlier. This have been: variables having path ex- order is important, because it ensures that pressions equivalent to purp' (referencing each subprogram will be analyzed exactly uninitialized variables), and pdup' (failing once, yet that data flow anomalies across to use a computed value) occurring simul- subprogram boundaries will be detected. taneously, usually due to a misspelling; If subprogram names are passed as argu- subprogram parameters having path ex-

Computing Surveys, Vol. 8, No 3, September 1976 L. D. Fosdick and L. J. Osterweil • 329

pressions equivalent to 1, caused by naming ther work should be done to widen the unused parameters in parameter lists; and class of errors detectable by means such as COMMON variables ha,ving path expres- those described in this paper. sions equivalent to purp or pdup' usually due to omitting COMMON declarations ACKNOWLEDGMENTS from higher level program units. The cost of using DAv~. has proven to be We want to close with a grateful recognition of relatively high, partly due to the fact that the stimulating and valuable discussions we have had on this subject with our colleagues and it is a prototype built for flexibility, and not students--especially Jim Boyle, Lori Clarke, Hal speed, and partly due to the failure to use Gabow, Shachindra Maheshwari, Carol Miesse, the more efficient algorithms described here. and Paul Zeiger--and the helpful comments of the We have observed the execution speed of referees. Finally, we gratefully acknowledge the the system to average between 0.3 and 0.5 financial assistance provided by the National seconds per source statement on the CDC Science Foundation in this work. 6400 computer for programs whose size ranged from several dozen to several thou- REFERENCES sand statements. The total cost per state- [1] A~o, A. V.; ANn ULLMAN, J. D. "Node ment has averaged between 7 and 9 cents listings for reducible flow graphs," in Proc. per statement for these test programs using of the 7th Annual ACM Symposium on Theory of Computing, 1975, ACM, New York, 1975, the University of Colorado Computing pp. 177-185. Center charge algorithm. It is, of course, [2] ALLEN, F.E. "Program optimization," in anticipated that these costs would decline Annual Review in Automatic Programming, Pergamon Press, New York, 1969, pp. 239- sharply if a production version of DAvE 307. were to be implemented. [3] ALLEN,F.E. "A basis for program optimi- z ation," in Proc. IFIP Congress 1971, North- Based on these experiences and observa- Holland Publ. Co., Amsterdam, The Nether- tions, we believe that systems like DAvE lands, 1972, pp. 385-390. can serve the important purpose of auto- [4] ALLEN, F. E.; AND COCKE, J. Graph- theoretzc constructs for program control flow matically performing a thorough initial scan analyszs, IBM Research Report RC3923, T. J. for the presence of certain types of errors. Watson Research Center, Yorktown Heights, It seems that the most useful characteristics New York, 1972. [5] ALLEN, F. E. "Interprocedural data flow of such systems are that 1) they require no analysis," in Proc. IFIP Congress 1974, human intervention or guidance and 2) they North Holland Publ. Co., Amsterdam, The Netherlands, 1974, pp. 398-402. are capable of scanning all paths for possible [6] ALLEN, F. E.; AND COCKE, J. "A program data flow anomalies. A human tester need Comm not be concerned with designing test cases [7] BALZEa, R. M. "EXDAMS: Extendable for this system, yet can be assured by the debugging and monitoring system," in system that no anomalies are present. In Proc. AFIPS 1969 Spring Jr. Computer Conf., Vol. 34, AFIPS Press, Montvale, case an anomaly is present, the system will N.J., 1969, pp. 567-580. so advise the tester and further testing or [8] CLARKE, L. A system to generate test data debugging would be necessary. Clearly such and symbolically execute programs, Dept. of Technical Report $Cu- a system is capable of detecting only a CS-060-75, Univ. of Colorado, Boulder, limited class of errors. Hence further testing 1975. would always be necessary. Through the [9] DENNIS, J.B. "First version of a data flow procedure language," in Lecture notes ~n use of a system such as DAVE, however, the computer science 19, G. Goos and J. Hart- thrust of this testing can be more sharply manis (Eds.), Springer-Verlag, New York, 1974, pp. 241-271. focussed. It seems that these systems could [10] FAmLE~C,R.E. "An experimental program be most profitably employed in the early testing facility," in Proc. First National phases of a testing regimen (e.g., as part of a Conf. on , 1975, IEEE $75CH0992-8C, IEEE, New York, 1975, compiler) and used to guide and direct later pp. 47-55. testing efforts involving more powerful [11] GOLDSTINE, H. H.; AND VON NEUMANN, J. Planning and coding problems for an elec- systems that employ such techniques as tronic computing instrument," in John yon symbolic execution. Towards this end, fur- Neumann, collected works, A. H. Taub (Ed.),

Computing Surveys, V~I. 8, No. 3, September 1976 330 • Data Flow Analysis In Software Reliability

Pergamon Press, London, England, 1963, [24] KNU~H,D. E. The art of computer program- pp. 80-235. ming, Vol. I fundamental algorithms, (2d [12] HABERMANN,A.N. Path expressions, Dept. Ed.), Addison Wesley Publ. Co., Reading, of Computer Science Technical Report, Mass., 1973. Carnegie-Mellon Univ., Pittsburgh, Pa., [25] KNUTH, D. E. An empirical study of 1975. FORTRAN programs, Software--Practice [13] HARARY,F. Graph theory, Addison-Wesley and Experience 1, 2 (1971), 105-134. Publ. Co., Reading, Mass., 1969. [26] MILLER, E. F., JR. RXVP, FORTRAN [14] HECHT, M. S.; AND ULLMAN, J. D. "Flow automated verificatwn system, Program Vali- graph reducibility," SIAM J. Computing dation Project, General Research Corp., 1, (1972), 188-202. Santa Barbara, Calif., 1974, pp. 4. [15] HECHT, M. S.; ANn ULLMAN, J.D. "Char- [27] OSTERWEIL, L. J.; AND FOSDICK, L D. acterizations of reducible flow graphs," "DAVE--a FORTRAN program analysis J. ACM 21, 3 (July 1974), 367-375. system," in Proc. Computer Science and [16] HECHT, M. S.; AND ULLMAN, J. D. "A Statistics: 8th Annual Symposium on the simple algorithm for global data flow analysis Interface, 1975, pp. 329-335. problems," SIAM J. Computing 4 (Dec. [28] OSTERWEIL, L. J.; AND FOSDICK, L. D.. 1975), 519-532. "DAVE--a validation, error detection and [17] HOPCROFT, J.; AND TARJAN, R. E. "Effi- documentation system for FORTRAN pro- cient algorithms for graph manipulation," grams," Software--Practice and Experience Comm. ACM 16 (June 1973), 372-378. (to appear 1976). [18] HOPCROFT,J. E.; AND ULLMAN, J.D. For- [29] RODRIGUEZ,J. D. A graph model for parallel mal languages and their relation to automata, computatwn, Report MAC-TR-64, Project Addison Wesley Publ. Co., Reading, Mass., MAC, MIT, Cambridge, Mass., 1969. 1969. [30] ROSEN, B. Data flow analysis for recurszve [19] HOWDEN,W.E. "Automatic case analysis PL/I programs, IBM Research Report of programs," in Proc. Computer Science and RC5211, T. J. Watson Research Center, Statistics: 8th Annual Symposium on the Yorktown Heights, New York, 1975. Interface, 1975, pp. 347-352. [31] SCHAEFFER, M. A mathematzcal theory of [29] KALLAL, V.; AND OSTERWEIL, L. J. Con- global program optimization, Prentice-Hall structing flowgraphs for assembly language Inc., Englewood Cliffs, N. J., 1973. programs, Dept. of Computer Science Tech- [32] STUCKI, L. G. "Automatic generation of nical Report Univ. of Colorado, Boulder, self-metric software," in Proc. IEEE Sym- (to appear 1976). [21] KARP, R.M. "A note on the application of posium on Computer Software Reliabdity, graph theory to digital computer program- 1973, IEEE $73CH0741-9CSR, IEEE, New ming," Information and Control 3 (1960), York, 1973, pp. 94-100. 179-190. [33] TARJAN, R. E. "Depth-first search and [22] KINo, J. C. "A new approach to program linear graph algorithms," SIAM J. Comput- testing," in Proc. Internatl. Conf. on Re- ing (Sept. 1972), 146-160. liable Software, 1975, IEEE ~75CH0940- [34] TARJAN, R. E. "Testing flow graph re- 7CSR, IEEE, New York, 1975, pp. 228-233. ducibility," J. Computer and System Sciences [23] KENNEDY, K.W. "Node listings applied to data flow analysis," in Proc. of 2nd ACM 9, 3 (Dec. 1974), 355-365. Symposium on Pmncipals of Programming [35] ULLMAN, J. D. "Fast algorithms for the Languages, 1975, ACM, New York, 1975, ehmination of common subexpressions," pp. 10-21. Acta Informatiea 2 (1973), 191-213.

Computing Surveys, Vol. 8, No. 3, September 1976