<<

From: AAAI Technical Report WS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.

AutomatingPath Analysis for Building Causal Models from Data: First Results and Open Problems

Paul R. Cohen Lisa BaUesteros AdamCarlson Robert St.Amant Experimental KnowledgeSystems Laboratory Department of Computer Science University of Massachusetts, Amherst [email protected]@cs.umass.edu [email protected] [email protected] Abstract mathematical basis than path analysis; they rely on evidence of nonindependence, a weaker criterion than Path analysis is a generalization of multiple linear regres- path analysis, which relies on evidence of correlation. sion that builds modelswith causal interpretations. It is Thosealgorithms will be discussed in a later section. an exploratory or discovery procedure for finding causal structure in correlational data. Recently, we have applied Wedeveloped the path analysis algorithm to help us path analysis to the problem of building models of AI discover causal explanations of how a complex AI programs, which are generally complex and poorly planning system works. The system, called Phoenix understood. For example, we built by hand a path- [Cohenet al., 89], simulates forest fires and the activities analytic causal model of the behavior of the Phoenix of agents such as bulldozers and helicopters. One agent, planner. Path analysis has a huge search space, however. called the fireboss, plans howthe others, whichare semi- If one measures N parameters of a system, then one can autonomous,should fight the fire; but things inevitably go build O(2N2) causal mbdels relating these parameters. wrong,winds shift, plans fail, bulldozers run out of gas, For this reason, we have developed an algorithm that and the fireboss soon has a crisis on its hands. At first, heuristically searches the space of causal models. This this chaos was appealing and we congratulated ourselves paper describes path analysis and the algorithm, and for building such a realistic environment. However,we soon realized that we could explain very little about the presents preliminary empirical results, including what we behavior of the fireboss. Weturned to regression analysis believe is the first example of a causal model of an AI system induced from performance data by another AI to answer some questions, such as, "Which has more system. impact on the time to contain a fire: the wind speed or the numberof times the fireboss must replanT’ But although regression assumedthese factors interacted, it provided no 1. INTRODUCTION explanation of their causal relationship. For example, we knew that the wind speed could affect the incidence of This paper describes a statistical discovery procedurefor replanning and not vice versa, but this causal, explanatory finding causal structure in correlational data, called path knowledgewas not to be found in regression models. Nor analysis lasher, 83; Li, 75] and an algorithm that builds would automated regression procedures (e.g., stepwise path-analytic models automatically, given data. This multiple regression) find causal models of our planner. work has the same goals as research in function finding Path analysis, however, is a generalization of regression and other discoverytechniques, that is, to find rules, laws, analysis that produces explicitly causal models. Webuilt and mechanisms that underlie nonexperimental data one such model of Phoenix by hand, and by automating [Falkenhainer & Michalski 86; Langley et al., 87; 1 path analysis as we describe below, we have been able to Schaffer, 90; Zytkow et al., 90]. Whereas function discover other causal explanations of how the Phoenix finding algorithms produce functional abstractions of fireboss works. (presumably) causal mechanisms, our algorithm produces explicitly causal models. Our workis most similar to that Readers whoare familiar with regression analysis might of Glymouret al. [87], who built the TETRADsystem. skip to the end of the next section, where we introduce Pearl [91; 93] and Spirtes [93] have recently developed path analysis, or Section 3 where we illustrate a path causal induction algorithms with a more general analysis of Phoenix. Section 4 describes our algorithm. Section 5 discusses two experiments, an informal one in which we applied the algorithm to Phoenix data, and a 1The term "nonexperimental" is perhaps confusing, because data are factorial experiment in which the behavior of the usually collected in an experiment. Nonexperimental means that the algorithm was probedby applying it to artificial data. experiment is over and the opportunity to manipulate variables to see effects has passed. Causal hypotheses must therefore be generated and tested with the data, alone.

AAAI-9$ Knowledge Discovery in Databases Workshop 1993 Page 153 beta coefficients is plausible because betas are standard- 2. BACKGROUND: REGRESSION ized partial regression coefficients; they represent the ef- fect of a predictor variable on Y whenall the other pre- Path analysis is a generalization of multiple linear regres- dictor variables are fixed. Youcan interpret 131 as what sion, so we will begin with regression. Simplelinear re- happens to Y when only XI is systematically varied. In gression finds a least-squares line relating a single this sense, beta coefficients providea statistical version of predictor variable x to a performance variable y. A the control you get in an experiment in which X2 and X3 least-squares line is one that minimizes the sum of are fixed and XI is varied; in such an experiment, the ef- squared deviations of predicted values from actual values. fect on Y is attributable to X1. (Alternatively, the effects That is, simple linear regression finds a line ~ = bx + a 2. might be due to an unmeasuredor latent variable that is that minimizesY.i(Yi - Yi ) correlated with Xl; we will not consider this case here.) Multiple linear regression finds least-squares rules (i.e., planes and hyperplanes) for more than one predictor vari- able, rules of the form ~=blxl+...+bkxk+a. The regression^equation is better represented in standardized form as Y = fliXi +f12X2+f13X3 (standardized variables are denoted with uppercase letters). The interpretation of this model is that a change in XI of one standard d^eviation, Sl, producesfll standard deviations changein Y. Thus, beta coefficients axe comparable: if fll =. 4, f12 =. 8, then a standard deviation changein X2 has twice the influence on Y as a standard deviation Figure 1: The path model that corresponds to change in XI. the multiple linear regression of Y To fred beta coefficients the following equations are on Xl, X2 and X3. derived from the regression equation and solved: = + + So far we have described how a prediction model 9x. = + + x.x. I~ = fllX1 +/~X2+ f13X3gives rise to a set of normalequa- tions, and to beta coefficients that makethe prediction = + + model a least-squares fit to our data. If this were all we The previous equations can be rewritten in terms of could do, path analysis would be identical to linear correlations: regression and not worth the effort. The power of path analysis is that we can specify virtually any prediction q,x,rrx’ = ~[~ +r ~zrx, x:x, +x, f12 + [33rx,+ ~rx3x, xl model we like, and then solve for beta coefficients that (1) ensure the modelis a least-squares fit to our data. rrx, = ~rx,x, + flzrx, x, + f13 Clearly, with these three equations we can solve for the 3. PATH ANALYSIS OF PHOENIX DATA three unknownbeta coefficients. Wehave not shownthat Let us illustrate a path model other than the regression these coefficients guarantee that Y =flaXl +/~_X2+f13X3 model with an example from the Phoenix system [Cohen is a least-squares rule, but the interested reader can find & Hart, 93; Cohenet al., 89; Hart & Cohen, 92]. Weran this demonstrationin [Li, 75] or any goodstatistics text...... Phoenix on 215 simulated forest fires and collected many The three equations (I) are called the normal equations, measurementsafter each trial, including: and they have an interesting interpretation, illustrated in WindSpeedThe wind speed during the trial Figure 1. Consider the first normal equation, rl’Xt =ill +fl2rx2xz +fl3rxax~. The [31 term is represented RTK The ratio of fireboss "thinking speed" to the in Figure 1 by the direct path between Xl, and Y; the rate at whichfires bum second term is represented by the indirect path from Xl, NumPlans The numberof plans tried before the fire is through X2 to Y; and the third term is represented by the contained indirect path through X3. Thus, the correlation rl’xt is given by the sumof the weightsof three paths in Figure 1, FirstPlan Thename of the first plan tried where the weight of a path is the product of the coeffi- Fireline The length of fireline dug by bulldozers cients (either correlations or betas) along the constituent links of the path. The second and third normal equations FinishTime The amount of simulated time required to have similar interpretations in Figure 1. By convention, contain the fire curved arcs without arrows represent correlations and di- We specified a prediction model and the path model rected arcs represent causes. The causal interpretation of shownin Figure 2, and solved for the path coefficients on

Page 154 Knowledge Discovery in Databases Workshop I998 AAAI-93 each link. The coefficient on a path from X to Y is the correlation of X and Y if X is uncorrelated with any other 4. AUTOMATIC GENERATION OF PATH node that points to Y; otherwise it is the beta coefficient MODELS from the regression of Y on all the correlated nodes that point directly to it (including X). Withthis rule it is easy The question that motivated the current research is to solve for path coefficients by hand, using the whether models like the one in Figure 2 can be generated correlation matrix and running multiple regressions as automatically by a heuristic search algorithm. This sec- necessary. tion describes such an algorithm. The estimated correlation between two nodes is the sum The search space for the algorithm is the set of all possible of the weights of all legal paths betweenthem. A legal path models. We can represent any path model as an path goes through no node twice, and, for each node on adjacency matrix of size N x N, where N is the numberof variables in the model. Thus the search space is of size the path, once a node has been entered by an arrowhead, it 2N2" cannot be left by an arrowhead. The weight of a path is the product of the coefficients along it. For example, the The algorithm uses a form of best-first search. A single estimated correlation between FirstPlan and FinishTime path model constitutes a state, while the sole operator is is the sumof three paths the addition of an arc to one variable in the model from = another (possibly new) variable. Webegin with a graph rFirstPlan, FinishTime (-" 432x. 287) + (. 219x. 506) consisting of just the dependent variables. During the (-.432x.843x.506) = -.19 search wemaintain a list of all the modelsand all possible modifications to those models. At each step in the search As it happens, this estimated correlation is very close to we select the best modificationin the list, apply it to its the actual empirical correlation, calculated fromour data. target model, evaluate the new model, and add it to the list. Wecontinue until either we have reached an accept- R//( able model or we canmakeno more significant improve- j ments. The most complexissue to be addressed in apply- -.183 " .287 ~, ing this algorithm is constructing the evaluation function. WindSpeed ~...... ~ NumPlans Wemust first distinguish two senses in which a modelcan 7 FinishTime FirstPlan " -- Fireline be "good." A commonstatistical measure of goodness is .219 R2, the percentage of variance in the dependentvariable, Y, accountedfor by the independentvariables X1, X2..... If R2 is low, then other variables besides XI, X2.... in- Figure 2: The result of a path analysis of fluence Y, and our understanding of Y is therefore in- Phoenix data complete. However,for any set of independent variables, no model accounts for more of the variance in Y than the regression model, in which all independent variables are The disparities or errors between estimated and actual correlated and all point directly to the dependentvariable correlations of FinishTimewith all the other factors are (e.g., Frzg. I). Nowonder this model accounts for so much shownin Table 1. With the exception of the disparity be- variance: every variable directly influences Y ! It is not a tween the correlations of Wb~Speedand FinishTime, parsimonious model. Nor is it likely to be a plansible the estimated and actual correlations accord well, suggest- causal model of any system we analyze. For example, a ing that Figure 2 is a goodmodel of our data. regression analysis of the Phoenix data treats WilwlSpeed and Fireline as causes at the same "level," that is, both pointing directly to FinishTime, but we know that WindSpeedis "causally upstream" of Fireline. In fact, WS RTK #Pln First Fline we knowthat wind speed influences the rate at which fires ^ bum, and, so, probably influences the amount of fireline rFactor FinishTime .118 -.488 .704 -.197 .765 that is cut. So R2, the statistical measureof goodness, is r Factor FinishTira~ -.053 -.484 .718 -.193 .755 not necessarily the researcher’s pragmatic measure of goodness. The researcher wants a model that is a plausi- ble causal story, and that includes as few correlational and Table 1: Errors in estimated correlations. The first rowis the estimated correlation of FinishTime and the factor causal relations amongvariables as possible, so causal in- fluences are localized, not dissipated through a networkof listed in each column;the secondrow is the actual, empir- ical correlation. correlations. Such a model will necessarily have a lower R2 than a highly-connected model, but will often be pre- ferred. In fact, whenwe constructed the Phoenix model by hand, an important measure of goodness was the errors

AAAI-93 Knowledge Discovery in Databases Workshop 1993 Page 155 between estimated and empirical correlations, and we changes to each model’s evaluation. This should greatly never2. even calculated R increase efficiency. Another distinction is between modification evaluation and model evaluation. Assuming that good models are 5. EXPERIMENTS clustered together in model space, a few heuristics can movethe search into that general region of the space. Wehave run several experiments to demonstrate the per- These are the modification heuristics. A model evaluation formance of our algorithm and probe factors that affect function should dominatethe search once the modification performance. In the first experiment we tested whether evaluation heuristics have broughtthe search into the right the algorithm could generate a model from the Phoenix neighborhood. This is achieved by including the model data. The second experiment was more exhaustive; for evaluation function as a term in the modification this weused artificial data. evaluation function. Werely on a variety of heuristics to guide search toward 5.1. EXPERIMENT1 good models and to evaluate a path model. These Weprovided the algorithm with data from 215 trials of comprise three general classes. The first contains what the Phoenix planner, specifically, Windspeed RTK wemight call syntactic criteria: NumPlans Fireline and FinishTime (we dropped ¯ no cycles are allowed; FirstPlan from the data set). Wedesignated FinishTime the dependent variable. The search space of models was ¯ there must be a legal path (as defined in Section 3) N2-N from every variable to a dependentvariable; large, 2 = 1,048,576, but the model-scoring and ¯ a dependent variable mayhave no outgoing arcs. modification-scoring functions were sufficiently powerful to limit the solution space to 6856 models. Whenthe al- The second class contains domainindependent heuristics. gorithm terminated, after four hours work on a Texas These apply to any path analysis problem; however, the Instruments Explorer II+ Lisp Machine, its best two mod- weighting of the heuristic may depend on the particular els were the ones shownin Figure 3. These models have problem. Someheuristic~s we have tried are muchto commendthem, but they are also flawed in some ¯ R2, the statistical measureof goodness; respects. A good aspect of the models is that they get the ¯ the predicted correlation error (minimize the total causal order of WindSpeed, NumPlans, and Fireline squared error betweenthe actual correlation matrix right: wind speed is causally upstream of the other and the predicted correlation matrix for the model); factors, which are measures of behavior of the Phoenix ¯ parsimony(i.e. the ratio of variables to arcs, or the planner. However, both models say WindSpeed causes numberof arcs that don’t introduce a newvariable); RTKwhen, in fact, these are independent variables set by ¯ the correlation betweenthe variables being connected the experimenter. (Later, probing this curiosity we by a newlink; realized that due to a sampling bias in our experiment, ¯ the total numberof variables and arcs. WindSpeed and RTKcovary, so connecting them is not absurd.) The third class represents domain/problem dependent forms of knowledge. These include knowledge that RTK ¯ particular variables are independent; WindSpeed NumPlans ~ FinishTime ¯ particular variables axe likely/unlikely causes of oth- ers; Fireline ¯ a particular range of values for a domainindependent Best Model heuristic is appropriate for the problem. RTK Our evaluation of a modification is a function of these 7 heuristics. In the current implementation we use a WindSpeed ~ NumPlans----I~FinishTime weighted sumof the actual correlation between the vari- Fireline ables to be linked, the predicted correlation error and the SecondBest evaluation of the resulting model. Model The modification evaluation step hides a good deal of computation required for the heuristic evaluation func- Figure 3: The best and second best Phoenix tions, including path coefficients in the model and the models found by the algorithm. correlation estimates between variables. In the basic algorithm we calculate these parameters from scratch for each newly generated model. Because these calculations A disappointing aspect of the models is that neither dominate search cost, we are currently working on im- recognizes the important influence of RTKon NumPlans proving the algorithm by incrementally calculating and the influence of NumP/anson Fireline. In defense

Page 156 Knowledge Discovery in Databases Workshop 1993 AAAI-93 of the algorithm, however, we note that it used model Our experiment was designed to test the effects of sample scoring and modification scoring functions that valued size on the performance of the algorithm. The Phoenix R2, while we, when we built Figure 2 by hand, were data included 215 values of each variable, but for this ex- concerned primarily about the errors between estimated periment we looked at four levels of the factor: 10, 30, and empirical correlations, and not concernedat all about 50, and 100 data per sample. Wethought that the perfor- R2. Thus we should not be too surprised that the mance of the algorithm might deteriorate when sample algorithm did not reproduce our modelin Figure 2. sizes becamesmall, because the effects of outlier values in the samples would be exacerbated. Wegenerated five Although the algorithm ran slowly, and is in someways true path models with four variables, shownin Figure 4. disappointing, it did producewhat we believe is the first For each model we generated 12 variants, that is, 12 sets causal model of a complex software system ever gener- of data for variables A,B,C and D that conformedto the ated automatically. correlation structure established in step 2, above. This produced 60 models. Weran our algorithm on these vari- 5.2. EXPERIMENT2 ants in each of the 4 conditions just described. Thus, the algorithm ran 240 times. The first experiment raised more questions than it an- swered: Does the performance of the algorithm depend The algorithm found the true model in 109 of the 240 tri- on the sample variance for each variable? Does perfor- als. Summarystatistics for these trials, and trials in which mance depend on the number of data for each variable? the algorithm did not find the true model, are presented in Howdo we know whether the algorithm is finding the Table 2. Whenthe algorithm did find the true model, it "right" model? To address these and other questions we explored 174.6 models, on average. This is about four ran an experiment in which we constructed path models percent of the 4096 modelsthat were possible. Trials that that represented "truth" and tested howfrequently our al- did not fred the true model explored only 86.96 models, gorittma could discover the true models. Specifically, we whichsuggests that one reason they failed to find the true followed these steps: modelis they gave up too soon. This in turn suggests that 1 our critelia for terminating a search can be improved. 1. Randomlyfill some cells in an adjacency matrix to produce a path model. 2. Randomlyassign weights to the links in the path model subject to the constraint that the R2 of the A"----.----..~ A"--I~ B----I~ C ~ D resulting modelshould exceed .9. ( P. 3. Generate data consistent with the weights in step 2. 4. Submit the data to our algorithm and record the models it proposed. o 5. Determine how well our algorithm discovered the tree model. Figure 4: Five path modelsused to test the al- Steps 3 and 5 require some explanation. Imagine some- gorithm. one asks you to generate two samples of numbers drawn from a normal (Gaussian) distribution with mean zero such that the correlation of the samples is a particular, In all trials, the average score of the best modelfound by specified value, say, .8. Nowmake the problem more the algorithm was better than the average score of the true difficult: generate N columnsof numbersso that all their model. Our immediate inclination was to say the scoring pairwise correlations have particular, specified values. criteria led the algorithm astray, away from the true Solving this problem(step 3, above) ensures that we gen- model. In fact, the problem arises from using R2 as our erate data with the samecorrelational structure as the path model2 evaluation function. The value of R model specified in step 2. The details of the process are monotonically increases as the number of links in the sketched in the Appendix. modelincreases. Consequently, in most of our trials, the Weevaluated the performance of our algorithm (step 5, model from which we generated the data is not the model above)by two criteria: with the highest score since the addition of newlinks to it produced a model with a higher R2. Wemust investigate Shared path score: For each true model, makea list of all other model evaluation metrics before we can evaluate the paths from each independent variable to the dependent whetheror not our algorithm will find the best model. variable; what fraction of these paths exist in the model discovered by our algorithm? Wewould like to place more stock in two measures of structural similarity betweenthe true model and the best Share0 link score: For each true model, makea list of all found model. The shared path score is the proportion of the finks betweenvariables; what fraction of these links paths in the best found model that also exist in the true exist in the modeldiscovered by our algorithm? model. For example, the top-left model in Figure 4 has 5 paths from A, B and C to D, so if the best found model

AAAI-93 Knowledge Discovery in Databases Workshop 1993 Page 157 was, say, the top-right model, whichhas only three paths DAGconstruction is based on the notion of conditional from A, B and C to D, then the shared path score would .... independenceunder the assumptionsthat fi is stable and be 0.6. Anothercriterion is the shared link score, whichis model minimality holds true. P is considered to be the fraction of the finks connecting variables in the true stable if it has a unique causal model. Minimality means model that are also in the best found model. it is reasonable to rule out any model for which there Unfortunately, neither score strongly differentiates trials exists a simpler model that is equally consistent with the in which the true model was found from those in which it data.. Twovariables x and y are considered independent if wasn’t. On average, 56-61%of the paths, and 50-54%of a change in either of them does not cause a changein the the links, in the true model are also in the best found other. Conditional independence can be stated in model. These numbersale not high. Onthe other hand, if mathematicalterms as e(xly, z) = e(xlz). Intuitively this the disparity between the true model and the best found means that given some knowledge about the value of x model was just two links, then the average shared link given z, discovering knowledgeabout y does not effect score would be 52%. So although the shared path and what we know about x. In this case, x and y are shared link scores are low, they suggest that the best conditionally independent on z. This condition is also found modeldiffered from the true one by little less than knownas the Markovcondition applied to graphs: for all two links on average. Clearly, the algorithm can be im- graph vertices x, x is independentof its remote ancestors proved, but it performs better than the first impression givenits parents. given by the shared path score and .~hared link score. To find D, a causal model, the algorithm begins with /3, by looking for a set Sxy c_ V for all pairs of variables x and y such that x and y are conditionally independent on Trials that did Trials that did z, k/z E Sxy. If no such set can be found and x and y are find the true not find the dependent-in every context, then an undirected link is model true model placed between x and y to represent that there is a potential causal influence betweenx and y. Mean Std. Mean Std. I Numberof models 174.600 93.300 86.962 72.770 In the second step, for all pairs of unconnectedvariables having undirected links to a commonvariable z, if Best model score .974 .O27 .954 .018 z E Sty then do nothing. Otherwise, add arrowheads pointifig at z. The reasoning behind this step is that if a True model score .847 .258 .708 .417 and b are not independent conditional on z, but are Shared path score .612 .202 .565 .133 correlated to z, then this correlation must be due to their commoneffect on z or through a commoncause. .542 .193 .501 .144 Shared link score Finish building D by recursively adding arrowheads by applying the following rules: Table 2. Summarystatistics from 240 trials. 1. If x and y are joined by an undirected link and there exists somedirected path from x to y, then add an Weran a one-wayanalysis of variance to find the effects arrowhead at y. Since x is known to effect y of the numberof data points in our samples. The depen- through x’s effect on someother variable, then the dent variables were the five measures in Table 2. direction of potential causal influence between x Happily, sample size appears to have no impact on the and y must be toward y (this prevents cycles). numberof models expandedby the algorithm, the score of 2. If x and y are not connectedbut there exists a variable the best found model, or the shared path or shared link z such that there is a direct effect of x on z and SCOreS. there is an undirected link from z to y (i.e. x ---) z- y ), then add an arrowheadat y. Nodes 6. RELATED ALGORITHMS and y are not connected because they were found to be conditionally independent on some Say. If z ~ Sxv, then arrowheads would have been i)laced 6.1 INDUCTIVE CAUSATION ALGORITHM towards z from both x and y in step two for the reason stated above. Therefore, x and y must be The underlying probability structure of a set V of random conditionally independent on z and since we know variables in a system can be described by a joint x has a causal influence on z, the direction of probability distribution P. Pearl and Verma[1991] have potential causal influence must be from z to y in developed a methodfor inducing causal structure from P, order for the conditional independencyto hold. a sampled distribution of V, which they refer to as the Inductive Causation Algorithm (IC algorithm). Causal Finally, if there is an unidirected or bidirected link from x models are represented by directed acyclic graphs (DAG) to y, then markall unidirected links from y to z in whichz whose nodes represent randomvariables and whose edges is not connectedto x. represent causal relationships between those variables.

Page 158 Knowledge Discovery in Databases Workshop 1993 AAAI-93 The causal model resulting from analysis of a data set by Partial equations represent correlational constraints on the IC algorithm will contain four types of links three variables. Let x, y, and z be three measured representing different relationships betweenthe variables variables in a causal model, then rxz = ray. ryz is they connect; unmarkedunidirectional (potential causal their partial equation. This is equivalent tb s~iying influence), marked causal influence (direct causal the paltial correlation betweenx and z (with y held influence), bidirectional (evidence of a commoncause), constant) is equal to zero (knownas a vanishing and undirected (undeterminedrelationship). The model partial). For a linear model, vanishing partials are used as a template for developinga causal theory which is equivalent to conditional independencies. consistent with the data. That theory is built upon the Tetrad equations represent correlational constraints on following theorems: four variables. If x,y,z, and v are four measured variables in a causal model, one of the following is If every link of a directed path from x to z is marked their tetrad equation: then x has a causal influence on z. A marked unidirectional link from x to y suggests a rxy. rzv = rxz. ry or, rxy. rzv -- rx2" ry = 0 direct causal influence of x on y. v, v ray. rzv = rxv. ryz, or, rxy. rzv - rxv. ryz = 0 The existence of a bidirectional link betweenx and y, rxz. ryv = rxv. ryz, or, rxz. ryv - rxv. ryz = 0 suggests a commoncause z affecting both x and y, and there is no direct causal influence between where r r is the sample correlation between them. variables ’+i and j. These constraints are knownas vanishing tetrad differences. A DAG,G, linearly implies ray. rzv = rxz. ry 0 iff either r or Pearl and Vermahave shownthat particular patterns of v- ¯ x~_ rzv = O, and rxz or ryv = 0, or there is a set O of dependencies dictate a causal relationship between random variable~ in G such that variables and that the IC algorithm can be used to recover rxy.o = rzv.o = rxz.o = ryv.o = O. the structure of those causal relationships. Howeverthere A ssociated probability. P(t), of a vanishing tetrad are situations for which the method returns poor or difference is the probability of getting a tetrad inconclusiveresults. Withrespect to the latter, if there are difference equal to or larger than that foundfor the no independencies betweenthe variables in/;, all of the sample, under the assumption that the difference variables will be interconnected with undirected links vanishes in the population. P(t) is determined by since it may be impossible to determine the expected lookup in a chart for the standard normal nature of the potential causal influence. It mayalso be distribution. impossible to determine the nature of a potential causal T-maxscore is the sum of all P(t) implied by the model influence betweentwo variables whenthere are no control and judged to hold in the population. variables acting on either of the variables in question. If the direction of potential causal influence between Tetrad score is the difference betweenthe sumof all P(t) variables x and y is uncertain but there exists a variable z implied by the model and judged to hold in the that is knownto affect x and not y, then we can eliminate population, and the weighted sum of (1-P(t)) for the possiblity of x causing y. z is knownas a control P(t) implied by the modelthat are not judged to hold in variable. the population. The input to TetradlI is a sample size, a correlation or covariance matrix, and a DAG,representing an initial causal structure, having no edges between measured 6.2 TETRADII variables and all having all latent variables connected. Spirtes et al. [1993] take a different approach to The search generates all possible models derived by discovering causal structure, based on constraints on addition of one edge to the initial model and each model correlations. Their approach is implementedin the Tetrad is ranked by its tetrad score, which basically rewards a II program. During search, constraints implied by a model for implying constraints which are judged to hold model are comparedto constraints found in the data, and in the population and penalizes the model for implying the programreturns a modelthat best reflects constraints constraints whichare not judged to hold in the population. in the data. To describe this methodit will be necessary For example, consider the DAG’sof Figure 5. 5a implies to give a few definitions: the following vanishing tetrad differences: An open path is a directed link from variable x (the l~ . r~w- r,~ . r~, =0 source) to variable y (the sink). r~.r~ r~.r~ = 0 A trek is a pair of open paths betweentwo variables x and r~ .r~ -1"~v-r~= 0 y, having the same source and intersecting only at that source. Oneof the paths can be the emptypath If the population distribution is represented by 5b, then which consists of one vertex (the source of the the only vanishing tetrad difference implied by 5a that trel0. exists in the population is the secondone.

AAAI-93 Knowledge Discovery in Databases Workshop 1993 Page 159 see more regression models. Clearly our measures of structural similarity, shared link score and shared path C------~ B score, also can be improved. These problem must be addressed before we can establish that the algorithm works well. X Y Z W XYZW At this early stage, we have not made any empirical a) b) comparisons between our algorithm and those of Pearl and Verma, and Spirtes and his colleagues. We believe that unlike Pearl’s algorithm, a lack of independencieswill Figure 5: DAGsimplying different sets of not negatively effect the results obtained by our vanishing tetrad differences. algorithm; and unlike TetradlI, ours does not require an initial model. However,the complexity of our algorithm is significantly greater than that of the others2 Moreover, This process is repeated recursively on each model it performs more computation for each model because it generated until no improvements can be made to any estimates edge weights and predicted correlations. We model. Search is guided by the t-maxscore since models are not convinced that this is a negative aspect of our are expandedaccording to their t-maxscore. If M"is the algorithm since we believe estimated correlations, and model generated by adding an edge to model M and M"s specifically the deviations between estimated and actual t-maxscore is less than the tetrad score of M, it is correlations, are important criteria in controlling search in eliminated from the search. A proof of the following our algorithm. theorem is provided by Spirtes (1993): Our algorithm would probably be helped a lot if the If M’ is a subgraph of M, then modification and model evaluation functions were the set of tetrad equations amongvariables of M" "monotonic" in a sense similar to TetradII, which that are linearly implied by Mis a subset of those eliminated models under particular conditions that linearly implied by G. guaranteed that no descendants of the eliminated models This theoremintuitively says that the addition of links to a- could be preferred to previous models. We need graph will never cause more tetrad equations to be modification and model evaluation functions that will implied by the resulting graph. It then follows that if allow us to prune subtrees of the space of models. For example,it is easy to computewhether a modification will t - max scoreM, < tetrad - scoreM, neither M or any modification of M’ will have a tetrad score as increase or decrease the difference betweenestimated and high as M. The algorithm essentially uses a form of best actual correlations; we might prune modifications that first search and in cases wherethe search is too slow to be don’t reducedisparities. practical, the depth of search is limited to reduce the plays an important role in humanunderstanding. amount of time spent on search. The program returns a As researchers, we feel it is not enough to be able to list of suggested additions of edges to the initial model predict the behavior of computer systems; we must strive which will yield the model found by the search to have to discover rules that dictate their behavior. Webelieve the highest tetrad score. behavior results from the interaction of a system’s architecture, environment, and task; all these components 7. CONCLUSION must be present in a causal model. These models are essential to establishing general theories concerning the ¯ Wehave described an algorithm for path analysis and first types of behaviors that might occur in certain types of results from experiments with the algorithm. In a crude tasks, architectures, and environments. Wehave yet to sense the algorithm "works," that is, it generates causal answer many questions about factors that affect the models from data. However,preliminary results suggest performance of our algorithm. Still, we are encouraged that other methods of modification and model evaluation by our results, and by the promise the algorithm holds for as well as algorithm evaluation need to be explored. We automatically finding causal models of systems. However believe our use of the difference between estimated and ponderousit is, howeverdifficult it is to evaluate, the fact empirical correlations, and of R2, as terms in our remain.~ that the algorithm generated causal models of a modification evaluation function serve to bring the search complex AI planning system, and promises to be a into the neighborhood of good model space, but it is valuable tool for those whoseek to understand AI systems obvious that R2 applied as a model evaluation function is with statistical methods. less than ideal because it drives the algorithm to generate regression models (i.e., "flat" models in which all the predictor variables point directly at the dependentvariable Acknowledgments and are intercorrelated). Followupexperiments to those 2 Wewould like to thank Glenn Shafer for his help with the reported earlier confirm this: As R is given more weight developmentof these ideas. in the modification evaluation and model evaluation, we

Page 160 Knowledge Discovery in Databases Workshop 1993 AAAI-93 This research was supported by DARPA-AFOSRcontract References F30602-91-C-0076.The United States Governmentis au- thorized to reproduce and distribute reprints for govern- Asher, H.B. 1983. Causal Modeling. Sage Publications, mental purposes notwithstanding any copyright notation NewburyPark, CA. hereon. Cohen, P.R. Empirical Methodsfor Artificial Intelligence. Forthcoming. Cohen, P.R & Hart, D.M., 1993. Path analysis models of an autonomous agent in a complex environment. To appear in Proceedings of The Fourth International Workshopon .41 and Statistics. Ft. Lauderdale, FL. 185- 189. Cohen, P.R., Greenberg, M.L., Hart, D.M.&’Howe, A.E., 1989. Trial by fire: Understanding the design require- ments for agents in complex environments. A/Magazine, 10(3): 32-48. Falkenhainer, B.C. & Michalski, R.S.; 1986. Integrating quantitative and qualitative discovery: the ABACUSsys- tem. 1(1): 367-401. Glymour,C., Scheines, R., Spirtes, P. & Kelly, K., 1987. Discovering Causal Structure. AcademicPress. Hart, D.M.& Cohen, P.R., 1992. Predicting and explain- ing success and task duration in the Phoenix planner. Proceedings of the First International Conference on AI Planning Systems. Morgan Kaufmann. 106-115. Langley, P., Simon, H.A., Bradshaw, G.L. & Zytkow, J.M., 1987. Scientific Discovery: Computational Explorations of the Creative Processes. The MITPress. Li, C.C. 1975. PathAnalysis-A Primer. BoxwoodPress. Pearl, J. & Verma,T.S., 1991. A theory of inferred causa- tion. Principles of Knowledge Representation and Reasoning: Proceedings of the Second International Conference, J. Allen, R. Fikes, & E. SandewaU(Eds.). Morgan Kaufman. 441--452. Pearl, J. & Wermuth, N., 1993. Whencan association graphs admit a causal interpretation? Preliminary Papers of the Fourth International Workshopon AI and Statistics. Ft. Lauderdale, FL. 141-150. Schaffer, C., 1990. A proven domain-independentscien- tific function-finding algorithm. In Proceedings of the Eighth National Conference on Artificial Intelligence. The MITPress. 828-833. Spirtes, P., Glymour, C. and Scheines, R. (1993) Causation, Prediction and Search. Springer-Veflag. Zytkow, J.M., Zhu, J. & Hussam, A., 1990. Automated discovery in a chemistry laboratory. In Proceedings of the Eighth National Conferenceon Artificial Intelligence. The MITPress. 889-894.

AAAI-98 Knowledge Discovery in Databases Workshop 1993 Page 161