COMPUTING PRACTICES Automated Debugging: Are We Close?
Despite increased automation in software engineering, debugging hasn’t changed much. A new algorithm promises to relieve programmers of the hit-or-miss approach to isolating a failure’s external root cause.
Andreas or the past 50 years, software engineers have determine if the failure is present or if the outcome is Zeller enjoyed tremendous productivity increases unresolved and feeds that information to the Delta Saarland as more and more tasks have become auto- Debugging code. University mated. Unfortunately, debugging—the Programmers can either hardcode the algorithm, process of identifying and correcting a fail- available from the Delta Debugging Web site ure’sF root cause—seems to be the exception, remain- (http://www.st.cs.uni-sb.de/dd/) or download and ing as labor-intensive and painful as it was five decades adapt the core of Wynot, a prototype debugger writ- ago. An engineer or programmer still has to notice ten in Python that runs on a Unix or Linux system. something, wonder why it happens, set up hypothe- Both the algorithm and Wynot (short for “worked yes- ses, and then attempt to confirm or refute them. terday, not today”) fit with any imperative language, In theory, there is no reason to continue this legacy. and Wynot is platform independent. Debugging can be just as disciplined, systematic, and Work on a “debugging server” is in progress. The quantifiable as any other area of software engineer- idea is to have a programmer submit the program to ing—which means that we should eventually be able be examined along with invocation details to a Web to automate at least part of it. So far, research has con- site, choose from a menu how to identify a failure, and centrated on program analysis because of its roots in click on “Submit.” Wynot will then automatically iso- compiler construction. But analysis requires complete late the failure-inducing circumstances and e-mail the knowledge about the program being examined and results to the programmer. Plans are for the server to does not scale well to large programs. be operational by spring 2002. Eventually, users might Testing is another way to gather knowledge about be able to run their own debugging server—for exam- a program because it helps weed out the circumstances ple, within a company intranet or as a module in an that aren’t relevant to a particular failure. If testing automated test system. reveals that only three of 25 user actions are relevant, for example, you can focus your search for the fail- HOW DELTA DEBUGGING WORKS ure’s root cause on the program parts associated with As Figure 1 shows, external failures typically come these three actions. If you can automate the search from program input, user interaction, and program process, so much the better. changes. Within each of these categories are myriad This is the premise of Delta Debugging, an algo- circumstances, any one of which could be the root rithm that uses the results of automated testing to sys- cause of the failure. tematically narrow the set of failure-inducing Delta Debugging requires a test to prove that each circumstances.1 Programmers supply a test function circumstance is really failure inducing. An automated for each bug and hardcode it into any imperative lan- testing environment typically provides such a test, guage. The test function checks a set of changes to but the number of test runs can be quite large. The
26 Computer 0018-9162/01/$17.00 © 2001 IEEE trade-off is that, unlike conventional debugging, Delta Debugging always produces a set of relevant 12 3 ✘ n failure-inducing circumstances, which offer signifi- HTML ✔ cant insights into the nature and cause of the failure. … Also, if you know something about the structure of the failure-inducing circumstance, fewer tests are required. Circumstances How time-consuming it is to incorporate Delta Debugging in a testing environment depends on the circumstances. Program input is typically easy to access, modify, and evaluate. Adapting Delta Debugging to run a program on a different input should take only a matter of hours. Examining user- interaction history requires external tools that record and play back user interaction. Program changes are Program moderately easy to access and simple to modify, but altering configurations can be difficult, depending on the tools you use. ✔ (pass) My colleagues and I used the Wynot prototype to y ✘ (fail) debug Mozilla, Netscape’s open-source Web browser Output ? (unresolved) project (http://www.mozilla.org/), and the GNU debugger. Specifically, we used Wynot to Ok the following operations cause Figure 1. How circum- • simplify HTML input that causes the Mozilla mozilla to crash consistently on my stances affect a pro- browser to fail—after 57 test runs, only one of machine gram’s behavior. A 896 HTML lines remained; failure can stem from • simplify Mozilla failure-inducing user interac- → Start mozilla a range of circum- tion—after 82 test runs, we saw that only 3 of 95 → Go to bugzilla.mozilla.org stances, including user actions were relevant for the failure; and → Select search for bug program input, such • identify failure-inducing code changes—after 97 → Print to file setting the bottom as an HTML page that tests, we narrowed a set of 178,000 changed lines and right margins to .50 (I use makes a browser fail, to one changed line in the GNU debugger that the file /var/tmp/netscape.ps) user interaction that caused a failure. → Once it’s done printing do the makes a program exact same thing again on the crash, or changes a SIMPLIFYING PROGRAM INPUT same file (/var/tmp/netscape.ps) programmer makes Mozilla is a work in progress with a wide audience → This causes the browser to crash after the program (Netscape 6 uses a Mozilla variant), and Mozilla engi- with a segfault fails a regression neers receive several dozens of bug reports a day. Their test. first step in processing a bug report is to simplify it— To simplify bug reports, BugAThon volunteers were eliminate all details that are irrelevant to the failure. supposed to load the Bugzilla Web page (http:// In part, a simplified bug report makes debugging eas- bugzilla.mozilla.org/) into their text editors and then ier because it replaces other reports with irrelevant follow the BugAThon instructions for simplifying details. Mozilla test cases: In July 1999, Bugzilla, the Mozilla bug database, listed more than 370 open, unsimplified bug reports, Start removing HTML markup, CSS rules, and lines and the queue was growing. Seeing that Mozilla engi- of JavaScript from the page. Start by removing the neers “faced imminent doom,” Eric Krock, the parts of the page that seem unrelated to the bug. Netscape product manager, sent out the Mozilla Every few minutes, check the page to make sure it BugAThon, a call for volunteers to help simplify bug still reproduces the bug. [. . . ] When you’ve cut away reports.2 Each volunteer was to turn a bug report into as much HTML, CSS, and JavaScript as you can, and a set of minimal test cases, in which every input would cutting away any more causes the bug to disappear, be significant in reproducing the failure. For every five you’re done.2 simplifications, a volunteer would get an invitation to the Mozilla launch party; 20 simplifications would earn This volunteer most likely carried out the process the volunteer a T-shirt signed by grateful engineers. manually, but there is no need to—especially in an The following was Bugzilla entry #24735: environment designed for automation. If you have an
November 2001 27 1
Figure 2. Simplifying automated test that tells whether or not the failure is for a total of about 21 minutes. Figure 2 shows how a bug in Mozilla’s still present, you can easily automate simplification. the algorithm worked. HTML input. The Setting up an automated test that checks whether Wynot prototype printing a specific HTML page works is not very dif- SIMPLIFYING USER INTERACTIONS starts cutting large ficult. All you need is a record-and-replay facility for Having simplified the HTML input, we were ready chunks of HTML input user interaction. to look at other details in the Mozilla bug report. Was (many characters at a One approach to automating simplification is to “setting the bottom and right margins to .50” really time) and gradually write a program that removes single characters from necessary, for example? In a separate test from the narrows the cut to just the input. After each removal, the program would HTML input test, we subjected the log of recorded a few characters. By check whether or not the failure still occurs, using the user interactions to Wynot to find the minimum set of line 26, only replay tool to automate execution. It would repeat the failure-inducing user actions. After 82 test runs (21
28 Computer record and play back events for the Unix/Linux X 2
FINDING FAILURE-INDUCING CODE CHANGES This is a classical instance of the “worked yesterday, The most important source of failure-inducing cir- not today” problem: Some change (or difference) causes cumstances is, of course, the code itself. I’ve shown a failure. In this case, the change is huge: The difference how Delta Debugging works to isolate failure-induc- between the source codes of GDB 4.16 and GDB 4.17 ing program input. In a sense, program code itself is is 178,000 lines—178,000 lines have been either added, input—input to some machine executing the program. deleted, or changed between the two releases. (If you So at least in theory, you could apply Delta Debugging consider that the GDB source code itself has about
November 2001 29 Figure 4. Delta Debugging (DD) opti- GDB with DD min algorithm mizations. Minimiz- Delta Debugging log … with DD algorithm 100,000 … plus scope information ing the set of changes between GDB 4.16 10,000 and GDB 4.17 isolates 1,000 the failure-inducing change after 288 100 tests. Narrowing the difference between Changes left 10 the two versions finds the same 050 100 150 200 250 300 change after 187 Tests executed tests. Grouping changes according to scope reduces 500,000 lines, and that about a third of the code has This change in a string constant from arguments to the tests to 97. been changed, you might wonder why everything else argument list was responsible for GDB 4.17 not inter- still works.) Somewhere within these 178,000 lines lie operating with DDD. Given the command “show the changes that caused DDD to fail—but where? arguments,” GDB 4.16 gives a reply that is obviously We used Delta Debugging to isolate this cause. First, constructed from the changed string constant: we decomposed the 178,000-line diff into 8,721 tex- tual changes in the GDB source, with any two textual Arguments to give program being debugged when it changes separated by a context of at least two is started is “a b c” unchanged lines. We then let Wynot isolate the fail- ure-inducing changes. Each test applied a subset of GDB 4.17, however, issues a slightly different (and changes to the GDB code, rebuilt GDB, and ran an grammatically correct) text automated test that verified whether or not the fail- ure was still present. Argument list to give program being debugged when Although we continued to use the isolation it is started is “a b c” approach to reduce the differences between the two original versions, we faced a new problem. Wynot which DDD could not parse. To solve the problem, we had no knowledge about the semantics of the simply reversed the GDB change; eventually, we changes and combined them more or less arbitrarily. upgraded DDD to make it work with the new GDB ver- This caused tests to turn out unresolved because the sion. changes would not integrate, the program would not We have applied Delta Debugging to find failure- build, and so on—in short, there was nothing to test. inducing changes in other projects as well. However, in First, we optimized the algorithm to handle unre- most of these projects, we had a version repository, solved outcomes: By applying smaller and smaller which lets us group changes by their creation date. With sets of changes, we decreased the risk of conflicting such knowledge, the risk of inconsistencies decreases changes. Second, rather than grouping changes arbi- dramatically. In one case,6 Delta Debugging reduced trarily, we grouped changes by their scope, that is, 344 ordered changes from the DDD version repository the directory, file, or function in which they were to a single failure-inducing change—in only 12 tests. applied. We assumed that a set of changes was less likely to be in conflict if they all applied to the same lthough Delta Debugging streamlines the scope. process of isolating failure-inducing input and Figure 4 shows the result of these Delta Debugging A failure-inducing code changes, programmers runs and the effect of the optimizations. Starting with must still follow the causality chain and decide where 8,721 changes, Wynot required between 288 and 97 to break it. A failure typically does not stem from only tests—the more optimizations applied, the fewer tests one circumstance. Breaking the chain for one failure required. On a 400-MHz Linux PC, each test took is easy, but breaking it to eliminate as many failures about 240 seconds to apply the changes and rebuild as possible is challenging. and run GDB; 97 tests thus required about 6.5 hours. We are currently applying Delta Debugging to In all three cases, Wynot reduced the 178,000 lines examine internal program states, which isolates even to the same one-line change that caused DDD to mal- more elements of the causality chain, and first results function: are promising.7 Future combinations of automated testing and program analysis may even assist in prop- diff -r gdb-4.16/gdb/infcmd.c gdb4.17/gdb/infcmd.c erly breaking the causality chain as well. 1239c1278 Also, program input is only one of many circum- < “Set arguments to give program being debugged stances that affect its behavior. The program’s envi- when it is started. \n\ ——— ronment, operating system, and runtime library may > “Set argument list to give program being also be relevant. We are exploring Delta Debugging debugged when it is started. \n\ for circumstances like the following:
30 Computer • environment settings (which part of the environ- 2. “The Gecko BugAThon,” Mozilla, http://www.mozilla. ment is relevant?), org/newlayout/bugathon.html (current Oct. 2001). • system libraries (after an upgrade, I am stuck in 3. R. Hildebrandt and A. Zeller, “Simplifying Failure- a DLL misconfiguration—why?), and Inducing Input,” Proc. ACM SIGSOFT Int’l Symp. • thread schedules in parallel programs (which part Software Testing and Analysis (ISSTA), ACM Press, of the schedule causes a nondeterministic behavior?). New York, 2000, pp. 135-145. 4. F. Tip, “A Survey of Program Slicing Techniques,” J. As we discover more about the structure of these cir- Programming Languages, Sept. 1995, pp. 121-189. cumstances and the resulting causality chain, we come 5. M. Weiser, “Program Slicing,” IEEE Trans. Software closer to passing much of the boredom and monotony Eng., July 1984, pp. 352-357. of debugging onto machines. Eventually, debugging 6. A. Zeller, “Yesterday, My Program Worked. Today, It may become as automated as testing—not only detect- Does Not. Why?” Proc. Joint 7th European Software ing failures, but also revealing how they came to be. ✸ Eng. Conf. ACM SIGSOFT Symp. Foundations of Soft- ware Eng., O. Nierstrasz and M. Lemoine, eds., vol. 1687, Lecture Notes in Computer Science, Springer- Acknowledgments Verlag, Berlin, 1999, pp. 253-267. Steven Barlow, Holger Cleve, Kerstin Reese, Torsten 7. A. Zeller, “Isolating Cause-Effect Chains from Com- Robschink, Gregor Snelting, Falk Schreiber, and the puter Programs,” http://www.st.cs.uni-sb.de/dd/ (cur- Computer reviewers provided helpful comments on rent Oct. 2001). earlier versions of this article. This work was conducted at the University of Passau, Germany, and funded by grant Sn 11/8-1 from Andreas Zeller is a software engineering professor at Deutsche Forschungsgemeinschaft. Saarland University in Germany. His research inter- ests include dynamic program analysis, configuration management, and program visualization. Zeller References received a PhD in computer science from the Techni- 1. A. Zeller, “Simplifying and Isolating Failure-Inducing cal University of Braunschweig, Germany. He is a Input,” to appear in IEEE Trans. Software Eng., http:// member of the IEEE Computer Society and the ACM. www.st.cs.uni-sb.de/dd/ (current Oct. 2001). Contact him at [email protected].
cluster computing collaborative computing dependable systems distributed agents distributed databases distributed multimedia grid computing middleware mobile & wireless systems operating systems real-time systems security IEEE Distributed Systems Online computer.org/dsonline