Debugging Concurrent Programs

Debugging Concurrent Programs CHARLES E. MCDOWELL and DAVID P. HELMBOLD Board of Studies in Computer and Information Sciences, University of California at Santa Cruz, Santa Cruz, California 95064 The main problems associated with debugging concurrent programs are increased complexity, the “probe effect,” nonrepeatability, and the lack of a synchronized global clock. The probe effect refers to the fact that any attempt to observe the behavior of a distributed system may change the behavior of that system. For some parallel programs, different executions with the same data will result in different results even without any attempt to observe the behavior. Even when the behavior can be observed, in many systems the lack of a synchronized global clock makes the results of the observation difficult to interpret. This paper discusses these and other problems related to debugging concurrent programs and presents a survey of current techniques used in debugging concurrent programs. Systems using three general techniques are described: traditional or breakpoint style debuggers, event monitoring systems, and static analysis systems. In addition, techniques for limiting, organizing, and displaying a large amount of data produced by the debugging systems are discussed. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; D.1.3 [Programming Techniques]: Concurrent Programming; D.2.4 [Software Engineering]: Program Verification-assertion checkers; D.2.5 [Software Engineering]: Testing and Debugging-debugging aids; diagnostics; monitors; symbolic execution; tracing Additional Key Words and Phrases: Distributed computing, event history, nondeterminism, parallel processing, probe-effect, program replay, program visualization, static analysis INTRODUCTION programs even harder than debugging sequential programs. In the remainder of this The interest in parallel programming has section we will justify this claim and outline grown dramatically in recent years. New the basic approaches currently used for de- languages, such as Ada’ and Modula II, bugging parallel programs. In Sections l-4 have built-in features for concurrency. we discuss each of these approaches in de- Older languages, such as C and FORTRAN, tail. We conclude with Section 5 and an have been extended in a variety of ways in appendix with tables that summarize the order to support parallel programming features of 35 systems designed for debug- [Gehani and Roome 1985; Karp 19871. ging parallel programs. The added complexity of expressing concurrency has made debugging parallel Difficulty Debugging Concurrent Programs ‘Ada is a registered trademark of the U.S. Govern- The classic approach to debugging sequen- ment (Ada Joint Program Office). tial programs involves repeatedly stopping This work was supported in part by IBM grants SL87033 and SL88096. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1989 ACM 0360-0300/89/1200-0593 $01.50 ACM Computing Surveys, Vol. 21, No. 4, December 1989 594 l C. E. McDowell and D. P. Helmbold the program during execution, examining existent [Lamport 19781. Without a syn- the state, and then either continuing or chronized global clock, it may be difficult reexecuting in order to stop at an earlier to determine the precise order of events point in the execution. This style of debug- occurring in distinct, concurrently execut- ging is called cyclical debugging. Unfortu- ing processors. nately, parallel programs do not always have reproducible behavior. Even when Basic Approaches they are run with the same inputs, their Some researchers distinguish between results can be radically different. These monitoring and traditional debugging differences are caused by races, which occur [Joyce et al. 19871. Monitoring is the pro- whenever two activities are allowed to process of gathering information about a progress in parallel. For example, one process gram’s execution. Debugging, as defined in may attempt to write a memory location the current ANSI/IEEE standard glossary while a second process is reading from that of software engineering terms, is “the pro- memory cell. The second process’s behavior cess of locating, analyzing, and correcting may differ radically, depending on whether suspected faults,” where a fault is defined its reads the new or old value. to be an accidental condition that causes a The cyclical debugging approach often program to fail to perform its required func- fails for parallel programs because the un- tion. Since monitoring is often an effective desirable behavior may not appear when procedure for locating incorrect behavior, the program is reexecuted. If the undesira- it should be considered a debugging tool. ble behavior occurs with very low probabil- For the purposes of this survey, tech- ity, the programmer may never be able to niques for debugging concurrent systems recreate the error situation. In fact, any have been organized into four groups: attempt to gain more information about the program may contribute to the difficulty of (1) Traditional debugging techniques can reproducing the erroneous behavior. This be applied with some success to parallel has been referred to as the “Heisenberg programs. These are discussed in Sec- Uncertainty” principle applied to software tion 1. [LeDoux and Parker 19851 or the “Probe (2) Event-based debuggers view the exe- Effect” [Gait 19851. For programs that con- cution of a parallel program as a se- tain races, any additional print or debug- quence (or several parallel sequences) ging statements may modify a crucial race, of events. The generation and analysis lowering the probability that the interest- of these sequences or event histories is ing behavior occurs. This interference can the subject of Section 2. be disastrous when attempting to diagnose (3) Techniques for displaying the flow of an error in a parallel program. control and distributed data associated The nondeterminism arising from races with parallel programs are presented in is particularly difficult to deal with because Section 3. the programmer often has little or no con- (4) Static analysis techniques based on trol over it. The resolution of a race may dataflow analysis of parallel programs depend on each CPU’s load, the amount of are presented in Section 4. These tech- network traffic, and nondeterminism in the niques allow some program errors to communication medium (e.g., exponential be detected without executing the backoff protocols [Tannenbaum 1981, pp. program. 292-2951). It is this nondeterministic behavior that tends to make understand- This survey covers a large number of ing, writing, and debugging parallel pro- research and commercial projects designed grams more difficult than their sequential to help produce error-free concurrent soft- counterparts. ware. It focuses primarily on systems that An additional problem found in distrib- are directed toward isolating program er- uted systems is that the concept of “global rors. A large body of work in formal pro- state” can be misleading or even non- gram verification and in program testing ACM Computing Surveys, Vol. 21, No. 4, December 1989 Debugging Concurrent Program 595 has been explicitly excluded from this sur- They have the potential of identifying a vey. Most of the systems surveyed fall into large class of program errors that are par- one of two general categories, traditional ticularly difficult to find using current dy- parallel debuggers (or what are sometimes namic techniques. These techniques have called “breakpoint” debuggers) and event- been applied mostly to parallel versions of based debuggers. Of course, some systems FORTRAN that do not support recursion. contain aspects of both classes. All of the As with the event-based debuggers, static systems (or in some cases proposed sys- analysis systems are still in the prototype tems) in these two general categories are stage. The primary problem with most listed in the tables in Appendix A. static analysis algorithms is that their In addition to traditional parallel debug- worst-case computational complexity is gers and event-based debuggers, some static often exponential. analysis systems are included. The static All three types of debugging systems have analysis systems surveyed fall somewhere made some progress in presenting the com- between debugging and testing. The static plex concurrent program state and the ac- analysis systems are distinguished from companying massive amounts of data to testing by not requiring program execution the user. Multiple windows is a useful and by generally checking for structural mechanism for interfacing with traditional faults instead of functional faults. That is, style debuggers for parallel systems. The the analysis tools have no knowledge of the abstraction capabilities of event-based de- intended function of the program and sim- buggers (see Section 2) have been used to ply identify program structures that are present interesting and potentially useful generally indicative of an error. These views of system states graphically [Hough systems do not appear in the comparison and Cuny 19871. table in Appendix A but are discussed in Section 4. 1. EXTENDING TRADITIONAL

Load more