construction Editors: Andy Hunt and Dave Thomas The Pragmatic Programmers www.pragmaticprogrammer.com

Software Archaeology

Andy Hunt and Dave Thomas

his isn’t programming, this is archaeol- This analogy is such a compelling and po- ogy!” the programmer complained, tentially useful one that Dave, Andy, Brian wading through the ancient rubble of Marick, and held a some particularly crufty piece of code. workshop on Software Archaeology at (One of our favorite jargon words: OOPSLA 2001 (the annual ACM Confer- Twww.tuxedo.org/~esr/jargon/html/entry/ ence on Object-Oriented Programming, Sys- crufty.html.) It’s a pretty good analogy, actu- tems, Languages, and Applications). The ally. In real archaeology, you’re investigating participants discussed common problems of trying to understand someone else’s code and shared helpful techniques and tips (see the “Tools and Techniques” sidebar).

Roll up your sleeves What can you do when someone dumps 250k lines of source code on your desk and simply says, “Fix this”? Take your first cue from real archaeologists and inventory the site: make sure you actually have all the source code needed to build the system. Next, you must make sure the site is secure. On a real dig, you might need to shore up the some situation, trying to understand what site with plywood and braces to ensure it does- you’re looking at and how it all fits together. n’t cave in on you. We have some equivalent To do this, you must be careful to preserve safety measures: make sure the version control the artifacts you find and respect and under- system is stable and accurate (CVS is a popu- stand the cultural forces that produced them. lar choice; see www.cvshome.org). Verify that But we don’t have to wait a thousand the procedures used to build the software are years to try to comprehend unfathomable ar- complete, reliable, and repeatable (see the Jan- tifacts; code becomes legacy code just about uary/February issue’s column for more on this as soon as it’s written, and suddenly we have topic). exactly the same issues as the archaeologists: Be aware of build dependency issues: in What are we looking at? How does it fit in many cases, unless you build from scratch, with the rest of the world? And what were you’re never really sure of the results. If they thinking? It seems we’re always in the you’re faced with build time measured in position of reading someone else’s code: ei- hours, with multiple platforms, then the in- ther as part of a code review, or trying to cus- vestment in accurate dependency manage- tomize a piece of open source software, or ment might be a necessity, not a luxury. fixing a bug in code that we’ve inherited. Draw a map as you begin exploring the

22 IEEE SOFTWARE March/April 2002 0740-7459/02/$17.00 © 2002 IEEE SOFTWARE CONSTRUCTION code. (Remember playing Colossal insight into the changes that were re- editing the code directly (AspectJ for Cave? You are in a maze of twisty lit- quired over the years. Java is available at www.aspectj.org). tle passages, all alike….) Keep de- Of course, unless you can prove oth- For instance, suppose you want to tailed notes as you discover priceless erwise, there’s no guarantee that the generate a trace log of every database artifacts and suspicious trapdoors. routine you’re examining is even being call in the system. Using something UML diagrams might be handy (on called. How much of the source con- like Aspect-J, you could specify what paper—don’t get distracted by a tains code put in for a future that never constitutes a database call (such as fancy CASE tool unless you’re al- arrived? Static analysis of the code can every method named “db*’” in a ready proficient), but so too are sim- prove whether a routine is being used in particular directory) and specify the ple notes. If there are more than one most languages. Some IDEs can help code to insert. of you on the project, consider using with this task, or you can write ad hoc Be careful, though. Introducing a Wiki or similar tool to share your tools in your favorite scripting lan- any extra code this way might pro- notes (you can find the original Wiki guage. As always, you should prove as- duce a “Heisenbug,” a bug intro- at www.c2.com/cgi/wiki?WikiWiki- sumptions you make about the code. In duced by the act of debugging. One Web and a popular implementation this case, adding specific unit tests helps solution to deal with this issue is to atwww.usemod.com/cgi-bin/wiki.pl. prove—and continue to prove—what a build in the instrumentation in the As you look for specific keywords, routine is doing (see www.junit.org for first place, when the original devel- routine names, and such, use the Java and www.xprogramming.org for opers are first building and testing search capabilities in your integrated other languages). the software. Of course, this brings development environment (IDE), the By now, you’ve probably started its own set of risks. One of the par- Tags feature in some editors, or tools to understand some of the terminol- ticipants described an ancient but such as Grep from the command line. ogy that the original developers used. still used mainframe program that For larger projects, you’ll need larger Wouldn’t it be great to stumble only works if the statements tools: you can use indexing engines across a Rosetta stone for your pro- are left in. such as Glimpse or SWISH++ (simple ject that would help you translate its Whether diagnostic tracing and Web indexing system for humans) to vocabulary? If there isn’t one, you instrumentation are added originally index a large source code base for can start a glossary yourself as part or introduced later via aspects, you fast searching. of your note-taking. One of the first might want to pay attention to what things you might uncover is that you are adding, and where. For in- The mummy’s curse there are discrepancies in the mean- stance, say you want to add code that Many ancient tombs were ru- ing of terms from different sources. records the start of a transaction. If mored to be cursed. In the software Which version does the code use? you find yourself doing that in 17 world, the incantation for many of places, this might indicate a struc- these curses starts with “we’ll fix it Duck blinds and aerial views tural problem with the code—and a later.” Later never comes for the In some cases, you want to ob- potential answer to the problem original developers, and we’re left serve the dynamics of the running you’re trying to solve. with the curse. (Of course, we never system without ravaging the source Instead of hiding in a “duck put things off, do we?) code. One excellent idea from the blind” and getting the view on the Another form of curse is found in workshop was to use aspects to sys- ground, you might want to consider misleading or incorrect names and tematically introduce tracing state- an aerial view of the site. Synoptic, comments that help us misunder- ments into the code base without plotting, and visualization tools pro- stand the code we’re reading. It’s vide quick, high-level summaries that dangerous to assume that the code or might visually indicate an anomaly in comments are being completely the code’s static structure, in the dy- truthful. Just because a routine is namic trace of its execution, or in the named readSystem is no guarantee data it handles. For instance, Ward that it isn’t writing a megabyte of Cunningham’s Signature Survey data to the disk. Instead of hiding in a method (http://c2.com/doc/Signa- Programmers rarely use this sort “duck blind” and getting tureSurvey) reduces each source file of cognitive dissonance on purpose; the view on the ground, to a single line of the punctuation. it’s usually a result of historical acci- It’s a surprisingly powerful way of dent. But that can also be a valuable you might want to seeing a file’s structure. You can also clue: how did the code get this way, consider an aerial view use visualization tools to plot data and why? Digging beneath these lay- from the volumes of tracing informa- ers of gunk, cruft, and patch upon of the site. tion languishing in text files. patch, you might still be able to see As with real archaeology, it pays to the original system’s shape and gain be meticulous. Maintain a deliberate

March/April 2002 IEEE SOFTWARE 23 SOFTWARE CONSTRUCTION

ately identified, and the build Tools and Techniques should be automatic and reliable. Leave a Rosetta stone. The project The workshop identified these analysis tools and techniques: glossary was useful for you as you learned the domain jargon; it will Scripting languages for be doubly useful for those who –ad hoc programs to build static reports (included by and so on) come after you. –filtering diagnostic output Make a simple, high-level treasure Ongoing documentation in basic HTML pages or Wikis map. Honor the “DRY” principle: Synoptic signature analysis, statistical analysis, and visualization tools Don’t duplicate information that’s Reverse-engineering tools such as Together’s ControlCenter in the code in comments or in a Operating-system-level tracing via truss and strace design document. Comments in Web search engines and tools to search for keywords in source files the code explain “why,” the code IDE file browsing to flatten out deep directory hierarchies of the source itself shows “how,” and the map Test harnesses such as Junit and CPPUnit shows where the landscape’s main API documentation generation using Javadoc, doxygen, and so on features are located, how they re- late to each other, and where to find more detailed information. Participants also identified these invasive tools: Build in instrumentation, tracing, and visualization hooks where ap- Hand-inserted trace statements plicable. This could be as simple Built-in diagnostic instrumentation, enabled in production code as well as tracing “got here” messages or Instrumentation to log history of data values at interface calls as intricate as an embedded HTTP Use of AspectJ to introduce otherwise invasive changes safely server that displays the applica- tion’s current status (our book, The Pragmatic Programmer (Ad- pace; keep careful records. Even for a Also, consider to which “school of dison-Wesley, 2000) discusses short term, quick patch that doesn’t programming’” the authors be- building testable code). require you to understand the whole longed. Regardless of implementa- Use consistent naming conven- code base, keep careful records of tion language, West-Coast Smalltalk- tions to facilitate automatic static what you’ve learned, what you’ve ers will write in a different style from code analysis and search tools. It tried, what worked, and what didn’t. European Modula programmers, for helps us humans, too. instance. In a way, this approach gets No evil spells. You know what the What were they thinking? us back to the idea of treating code as incantations sound like. Don’t let Archaeologists generally don’t literature. What was the house style? the mummy’s curse come back to make wisecracks about how stupid a By understanding what the devel- haunt programmers later. The particular culture was (even if they opers were thinking, what influenced longer a curse festers and grows, did throw dead bodies into the only them, what techniques they were the worse it is when it strikes— good drinking well). In our industry, fond of, and which ones they were and curses often backfire against we generally don’t show such re- unaware of, you will be much better their creators first. straint. But it’s important when read- positioned to fully understand the ing code to realize that apparently code they produced and take it on as bone-headed decisions that appear to your own. Acknowledgments be straight out of a “Dilbert” cartoon Many thanks to the other workshop orga- seemed perfectly reasonable to the de- Leaving a legacy nizers, Brian Marick and Ward Cunningham, and to the attendees: Ken Anderson, Vladimir velopers at the time. Understanding Given that today’s polished code Degen, Chet Hendrickson, Michael Hewner, “what they were thinking” is critical will inevitably become the subject of Kevin Johnson, Norm Kerth, Dominik to understanding how and why they some future developer’s archaeologi- Kuropka, Dragos Manolescu, John McIntosh, wrote the code the way they did. cal dig, what can we do to help Walter Risi, Andy Schneider, Glenn Vander- If you discover they misunderstood them? How can we help them com- burg, and Charles Weir. something, you’ll likely find that mis- prehend “what we were thinking” take in more than one place. But rather and work with our code? Andy Hunt and Dave Thomas are partners in The Pragmatic Programmers, LLC. They feel that software consultants who can’t than simply “flipping the bozo bit” on program shouldn’t be consulting, so they keep current by devel- the original authors, try to evaluate Ensure the site is secure when you oping complex software systems for their clients. They also offer their strengths as well as weaknesses. leave. Every file related to the pro- training in modern development techniques to programmers and their management. They are co-authors of The Pragmatic Pro- You might find lost treasure—buried ject should be under version con- grammer and Programming Ruby, both from Addison-Wesley. domain expertise that’s been forgotten. trol, releases should be appropri- Contact them via www.pragmaticprogrammer.com.

24 IEEE SOFTWARE March/April 2002