<<

Introduction to the Debugging Format Michael J. Eager, Eager Consulting February, 2007

It would be wonderful if we could write quence of simple operations, registers, difficult time connecting its manipulations programs that were guaranteed to work memory addresses, and binary values of low­level code to the original source correctly and never needed to be debugged. which the processor actually understands. which generated it. Until that halcyon day, the normal pro­ After all, the processor really doesn't care The second challenge is how to describe gramming cycle is going to involve writing whether you used object oriented program­ the program and its relationship a program, compiling it, executing it, and ming, templates, or smart pointers; it only to the original source with enough detail to then the (somewhat) dreaded scourge of understands a very simple set of operations allow a debugger to provide the program­ debugging it. And then repeat until the pro­ on a limited number of registers and mem­ mer useful information. At the same time, works as expected. ory locations containing binary values. the description has to be concise enough so It is possible to debug programs by in­ As a reads and parses the that it does not take up an extreme amount serting code that prints values of various in­ source of a program, it collects a variety of of space or require significant processor teresting variables. Indeed, in some situa­ information about the program, such as the time to interpret. This is where the DWARF tions, such as debugging kernel drivers, this line numbers where a variable or function Debugging Format comes in: it is a compact may be the preferred method. There are is declared or used. Semantic analysis ex­ representation of the relationship between low­level debuggers that allow you to step tends this information to fill in details such the executable program and the source in a through the executable program, instruc­ as the types of variables and arguments of way that is reasonably efficient for a debug­ tion by instruction, displaying registers and functions. Optimizations may move parts of ger to process. memory contents in binary. the program around, combine similar pieces, expand inline functions, or remove But it is much easier to use a source­lev­ parts which are unneeded. Finally, code The Debugging el debugger which allows you to step generation takes this internal representa­ Process through a program's source, set break­ tion of the program and generates the actu­ points, print variable values, and perhaps a hen a programmer runs a program al machine instructions. Often, there is an­ few other functions such as allowing you to under a debugger, there are some other pass over the machine code to per­ W call a function in your program while in the common operations which he or she may form what are called "peephole" optimiza­ debugger. The problem is how to coordi­ want to do. The most common of these are tions that may further rearrange or modify nate two completely different programs, setting a breakpoint to stop the debugger at the code, for example, to eliminate dupli­ the compiler and the debugger, so that the a particular point in the source, either by cate instructions. program can be debugged. specifying the line number or a function All­in­all, the compiler's task is to take name. When this breakpoint is hit, then the well­crafted and understandable source the programmer usually would like to dis­ Translating from code and convert it into efficient but essen­ play the values of local or global variables, Source to tially unintelligible machine language. The or the arguments to the function. Display­ better the compiler achieves the goal of cre­ ing the call stack lets the programmer know Executable ating tight and fast code, the more likely it how the program arrived at the breakpoint he process of compiling a program is that the result will be difficult to under­ in cases where there are multiple execution Tfrom human­readable form into the bi­ stand. paths. After reviewing this information, the nary that a processor executes is quite com­ programmer can ask the debugger to con­ plex, but it essentially involves successively During this translation process, the tinue execution of the program under test. recasting the source into simpler and sim­ compiler collects information about the There are a number of additional opera­ pler forms, discarding information at each program which will be useful later when tions that are useful in debugging. For ex­ step until, eventually, the result is the se­ the program is debugged. There are two challenges in doing this well. The first is ample, it may be helpful to be able to step through a program line by line, either en­ Michael Eager is Principal Consultant at that in the later parts of this process, it may tering or stepping over called functions. Eager Consulting (www.eagercon.com), be difficult for the compiler to relate the Setting a breakpoint at every instance of a specializing in development tools for changes it is making to the program to the template or inline function can be impor­ embedded systems. He was a member original that the programmer tant for debugging ++ programs. It can of PLSIG's DWARF standardization com­ wrote. For example, the peephole optimizer be helpful to stop just before the end of a mittee and has been Chair of the may remove an instruction because it was function so that the return value can be dis­ DWARF Standards Committee since able to switch around the order of a test in played or changed. Sometimes the pro­ 1999. Michael can be contacted at code that was generated by an inline func­ grammer may want to bypass execution of [email protected]. tion in the instantiation of a C++ template. a function, returning a known value instead © Eager Consulting, 2006, 2007 By the time it gets its metaphorical hands on the program, the optimizer may have a of what the function would have (possibly Sun extensions. Nonetheless, stabs is still A Brief History of incorrectly) computed. widely used. DWARF2 There are also data related operations COFF stands for Common that are useful. For example, displaying Format and originated with System V the type of a variable can avoid having to Release 3. Rudimentary debugging infor­ DWARF 1 ─ Unix SVR4 sdb look up the type in the source files. Dis­ mation was defined with the COFF format, and PLSIG playing the value of a variable in different but since COFF includes support for named WARF originated with the C compiler formats, or displaying a memory or register sections, a variety of different debugging Dand sdb debugger in Unix System V in a specified format is helpful. formats such as stabs have been used with Release 4 (SVR4) developed by Bell Labs in COFF. The most significant problem with There are some operations which might the mid­1980s. The Programming Lan­ COFF is that despite the Common in its be called advanced debugging functions: guages Special Interest Group (PLSIG), part name, it isn’t the same in each architecture for example, being able to debug multi­ of Unix International (UI), documented the which uses the format. There are many threaded programs or programs stored in DWARF generated by SVR4 as DWARF 1 in variations in COFF, including XCOFF (used write­only memory. One might want a de­ 1989. Although the original DWARF had on IBM RS/6000), ECOFF (used on MIPS bugger (or some other program analysis several clear shortcomings, most notably and Alpha), and the Windows PE­COFF. tool) to keep track of whether certain sec­ that it was not very compact, the PLSIG de­ Documentation of these variants is avail­ tions of code had been executed or not. cided to standardize the SVR4 format with able to varying degrees but neither the ob­ Some debuggers allow the programmer to only minimal modification. It was widely ject module format nor the debugging in­ call functions in the program being tested. adopted within the embedded sector where formation is standardized. In the not­so­distant past, debugging pro­ it continues to be used today, especially for grams that had been optimized would have PE­COFF is the object module format small processors. been considered an advanced feature. used by Microsoft Windows beginning with Windows 95. It is based on the COFF for­ DWARF 2 ─ PLSIG The task of a debugger is to provide the mat and contains both COFF debugging programmer with a view of the executing he PLSIG continued on to develop and data and Microsoft’s own proprietary Code­ program in as natural and understandable document extensions to DWARF to ad­ View or CV4 debugging data format. Docu­ T fashion as possible, while permitting a wide dress several issues, the most important of mentation on the debugging format is both range of control over its execution. This which was to reduce the amount of data sketchy and difficult to obtain. means that the debugger has to essentially that were generated. There were also addi­ reverse much of the compiler’s carefully OMF stands for Object Module Format tions to support new languages such as the crafted transformations, converting the pro­ and is the object used in CP/M, up­and­coming C++ language. DWARF 2 gram’s data and state back into the terms DOS and OS/2 systems, as well as a small was released as a draft standard in 1990. that the programmer originally used in the number of embedded systems. OMF de­ In an example of the domino theory in program’s source. fines public name and line number infor­ action, shortly after PLSIG released the mation for debuggers and can also contain The challenge of a debugging data for­ draft standard, fatal flaws were discovered Microsoft CV, IBM PM, or AIX format de­ mat, like DWARF, is to make this possible in Motorola's 88000 microprocessor. Mo­ bugging data. OMF only provides the most and even easy. torola pulled the plug on the processor, rudimentary support for debuggers. which in turn resulted in the demise of IEEE­695 is a standard object file and Open88, a consortium of companies that Debugging Formats debugging format developed jointly by Mi­ were developing computers using the here are several debugging formats: crotec Research and HP in the late 1980’s 88000. Open88 in turn was a supporter of Tstabs, COFF, PE­COFF, OMF, IEEE­695, for embedded environments. It became an Unix International, sponsor of PLSIG, which and three versions of DWARF, to name IEEE standard in 1990. It is a very flexible resulted in UI being disbanded. When UI some common ones. I’m not going to de­ specification, intended to be usable with al­ folded, all that remained of the PLSIG was scribe these in any detail. The intent here most any machine architecture. The de­ a mailing list and a variety of ftp sites that is only to mention them to place the bugging format is block structured, which had various versions of the DWARF 2 draft DWARF Debugging Format in context. corresponds to the organization of the standard. A final standard was never re­ source better than other formats. Although leased. The name stabs comes from symbol ta­ it is an IEEE standard, in many ways IEEE­ ble strings, since the debugging data were Since Unix International had disap­ 695 is more like the proprietary formats. originally saved as strings in Unix’s a.out peared and PLSIG disbanded, several orga­ Although the original standard is readily object file’s symbol table. Stabs encodes nizations independently decided to extend available from IEEE, Microtec made a num­ the information about a program in text DWARF 1 and 2. Some of these extensions ber of extensions to support C++ and opti­ strings. Initially quite simple, stabs has were specific to a single architecture, but mzed code which are poorly documented. evolved over time into a quite complex, oc­ others might be applicable to any architec­ The IEEE standard was never revised to in­ casionally cryptic and less­than­consistent ture. Unfortunately, the different organiza­ corporate any of the Microtec and other debugging format. Stabs is not standard­ tions didn’t work together on these exten­ changes. ized nor well documented1. Sun Micro­ sions. Documentation on the extensions is systems has made a number of extensions generally spotty or difficult to obtain. Or as to stabs. GCC has made other extensions, a GCC developer might suggest, tongue while attempting to reverse engineer the 2 The name DWARF is something of a pun, since it was developed along with the object file format. The 1 In 1992, the author wrote an extensive document de­ name may be an acronym for “Debugging With Attribut­ scribing the stabs generated by Sun Microsytems' com­ ed Record Formats”, although this is not mentioned in pilers. Unfortunately, it was never widely distributed. any of the DWARF standards.

Introduction to the DWARF Debugging Format 2 Michael J. Eager firmly in cheek, the extensions were well a program, you first look in the current sion of a language on a limited range of ­ documented: all you have to do is read the scope, then in successive enclosing scopes chitectures. compiler source code. DWARF was well on until you find the symbol. There may be While DWARF is most commonly asso­ its way to following COFF and becoming a multiple definitions of the same name in ciated with the ELF object file format, it is collection of divergent implementations different scopes. very naturally independent of the object file format. It rather than being an industry standard. represent a program internally as a tree. can and has been used with other object DWARF follows this model in that it is file formats. All that is necessary is that the DWARF 3 ─ Free Standards also block structured. Each descriptive enti­ different data sections that make up the Group ty in DWARF (except for the topmost entry DWARF data be identifiable in the object espite several on­line discussions which describes the source file) is con­ file or executable. DWARF does not dupli­ Dabout DWARF on the PLSIG email list tained within a parent entry and may con­ cate information that is contained in the (which survived under X/Open (later Open tain children entities. If a node contains object file, such as identifying the processor Group) sponsorship after UI’s demise), multiple entities, they are all siblings, relat­ architecture or whether the file is written in there was little impetus to revise (or even ed to each other. The DWARF description big­endian or little­endian format. finalize) the document until the end of of a program is a tree structure, similar to 1999. At that time, there was interest in the compiler’s internal tree, where each Debugging extending DWARF to have better support node can have children or siblings. The for the HP/Intel IA­64 architecture as well nodes may represent types, variables, or Information Entry as better documentation of the ABI used by functions. This is a compact format where (DIE) C++ programs. These two efforts separat­ only the information that is needed to de­ ed, and the author took over as Chair for scribe an aspect of a program is provided. the revived DWARF Committee. The format is extensible in a uniform fash­ Tags and Attributes ion, so that a debugger can recognize and he descriptive entity in DWARF is Following more than 18 months of de­ ignore an extension, even if it might not Tthe Debugging Information Entry velopment work and creation of a draft of understand its meaning. (This is much bet­ (DIE). A DIE has a tag, which specifies the DWARF 3 specification, the standard­ ter than the situation with most other de­ what the DIE describes and a list of at­ ization hit what might be called a soft bugging formats where the debugger gets tributes which fill in details and further de­ patch. The committee (and this author, in fatally confused attempting to read modi­ scribes the entity. A DIE (except for the top­ particular) wanted to insure that the fied data.) DWARF is also designed to be most) is contained in or owned by a parent DWARF standard was readily available and extensible to describe virtually any proce­ DIE and may have sibling DIEs or children to avoid the possible divergence caused by dural on any ma­ DIEs. Attributes may contain a variety of multiple sources for the standard. The chine architecture, rather than being bound values: constants (such as a function DWARF Committee became the DWARF to only describing one language or one ver­ name), variables (such as the start address Workgroup of the Free Standards Group in 2003. Active development and clarification of the DWARF 3 Standard resumed early in 2005 with the goal to resolve any open is­ hello.c: sues in the standard. A public review draft 1: int main() was released to solicit public comments in 2: { October and the final version of the 3: printf("Hello World!\n"); DWARF 3 Standard was released in Jan­ 4: return 0; uary, 2006. After the Free Standards 5: } Group merged with Open Source Develop­ ment Labs (OSDL) to form the Foun­ DIE – Compilation Unit dation, the DWARF Committee returned to independent status and created its own Dir = /home/dwarf/examples web site at dwarfstd.org. Name = hello.c LowPC = 0x0 HighPC = 0x2b DWARF Overview Producer = GCC ost modern programming languages block structured: each entity (a class definition or a function, for example) is contained within another entity. Each DIE – Subprogram file in a C program may contain multiple data definitions, multiple variable defini­ Name = main DIE – Base Type tions, and multiple functions. Within each File = hello.c C function there may be several data defini­ Line = 2 Name = int tions followed by executable statements. A Type = int ByteSize = 4 statement may a compound statement that LowPC = 0x0 Encoding = signed in turn can contain data definitions and ex­ HighPC = 0x2b integer ecutable statements. This creates lexical External = yes scopes, where names are known only with­ in the scope in which they are defined. To find the definition of a particular symbol in Figure 1. Graphical representation of DWARF data

Introduction to the DWARF Debugging Format 3 Michael J. Eager for a function), or references to another tion for these types, C only specifies some ue is 16 bits wide and an offset from the DIE (such as for the type of a function’s re­ general characteristics, allowing the com­ high­order bit of zero. (This is a real­life ex­ turn value). plier to pick the actual specifications that ample taken from an implementation of hello.c best fit the target processor. Some lan­ Pascal that passed 16­bit integers in the top Figure 1 shows C's classic guages, like Pascal, allow new base types to half of a word on the stack.) program with a simplified graphical repre­ be defined, for example, an integer type The DWARF base types allow a number which can hold integer values be­ of different encodings to be described, in­ tween 0 and 100. Pascal doesn't cluding address, character, fixed point, DW_TAG_base_type specify how this should be imple­ floating point, and packed decimal, in addi­ DW_AT_name = int mented. One compiler might im­ DW_AT_byte_size = 4 tion to binary integers. There is still a little plement this as a single byte, an­ DW_AT_encoding = signed ambiguity remaining: for example, the ac­ other might use a 16­bit integer, tual encoding for a floating point number is a third might implement all inte­ Figure 2a. int base type on 32­bit processor. not specified; this is determined by the en­ ger types as 32­bit values no mat­ coding that the hardware actually supports. ter how they are defined. In a processor which supports both 32­bit With DWARF Version 1 and and 64­bit floating point values following DW_TAG_base_type DW_AT_name = int other debugging formats, the the IEEE­754 standard, the encodings rep­ DW_AT_byte_size = 2 compiler and debugger are sup­ resented by “float” are different depending DW_AT_encoding = signed posed to share a common under­ on the size of the value. standing about whether an int is Figure 2b. int base type on 16­bit processor 16, 32, or even 64 bits. This be­ comes awkward when the same sentation of its DWARF description. The hardware can support topmost DIE represents the compilation different size integers or when unit. It has two “children”, the first is the different compilers make differ­ DW_TAG_base_type DW_AT_name = word DIE describing main and the second de­ implementation decisions for the same target processor. These DW_AT_byte_size = 4 scribing the base type int which is the type DW_AT_bit_size = 16 of the value returned by main. The sub­ assumptions, often undocument­ DW_AT_bit_offset = 0 program DIE is a child of the compilation ed, make it difficult to have com­ DW_AT_encoding = signed unit DIE, while the base type DIE is refer­ patibility between different com­ enced by the Type attribute in the subpro­ pilers or debuggers, or even be­ Figure 3. 16­bit word type stored in the top 16­ gram DIE. We also talk about a DIE “own­ tween different versions of the bits of a 32­bit word. ing” or “containing” the children DIEs. same tools. DWARF base types provide Types of DIEs the lowest level mapping be­ <1>: DW_TAG_base_type DW_AT_name = int IEs can be split into two general types. tween the simple data types and how they are implemented on DW_AT_byte_size = 4 DThose that describe data including DW_AT_encoding = signed data types and those that describe functions the target machine's hardware. and other executable code. This makes the definition of int <2>: DW_TAG_variable explicit for both Java and C and DW_AT_name = x allows different definitions to be DW_AT_type = <1> Describing Data and used, possibly even within the Types same program. Figure 2a shows Figure 4. DWARF description of “int x”. the DIE which describes int on a ost programming languages have so­ typical 32­bit processor. The at­ Mphisticated descriptions of data. tributes specify the name (int), an encoding Type Composition There are a number of built­in data types, (signed binary integer), and the size in named variable is described by a DIE pointers, various data structures, and usual­ bytes (4). Figure 2b shows a similar defini­ ly ways of creating new data types. Since Awhich has a variety of attributes, one tion of int on a 16­bit processor. (In Figure of which is a reference to a type definition. DWARF is intended to be used with a vari­ 2, we use the tag and attribute names de­ ety of languages, it abstracts out the Figure 4 describes an integer variable fined in the DWARF standard, rather than named x. (For the moment we will ignore and provides a representation that can be the more informal names used above. The used for all supported language. The prima­ the other information that is usually con­ names of tags are all prefixed with tained in a DIE describing a variable.) ry types, built directly on the hardware, are DW_TAG and the names of attributes are the base types. Other data types are con­ prefixed with DW_AT.) The base type for int describes it as a structed as collections or compositions of signed binary integer occupying four bytes. these base types. The base types allow the compiler to The DW_TAG_variable DIE for x gives its describe almost any mapping between a name and a type attribute, which refers to Base Types programming language scalar type and the base type DIE. For clarity, the DIEs are how it is actually implemented on the pro­ labeled sequentially in the this and follow­ very programming language defines cessor. Figure 3 describes a 16­bit integer several basic scalar data types. For ex­ ing examples; in the actual DWARF data, a E value that is stored in the upper 16 bits of a reference to a DIE is the offset from the ample, both C and Java define int and dou­ four byte word. In this base type, there is a ble. While Java provides a complete defini­ start of the compilation unit where the DIE bit size attribute that specifies that the val­ can be found. References can be to previ­

Introduction to the DWARF Debugging Format 4 Michael J. Eager ously defined DIEs, as in Figure 4, or to Array vate, or protected. These are described with DIEs which are defined later. Once we the accessibility attribute. have created a base type DIE for int, any rray types are described by a DIE C and C++ allow bit fields as class variable in the same compilation can refer­ Awhich defines whether the data is members that are not simple variables. ence the same DIE3. stored in column major order (as in Fortan) or in row major order (as in C or C++). These are described with bit offset from the DWARF uses the base types to construct The index for the array is represented by a start of the class instance to the left­most other data type definitions by composition. subrange type that gives the lower and up­ bit of the bit field and a bit size that says A new type is created as a modification of per bounds of each dimension. This allows how many bits the member occupies. another type. For example, Figure 5 shows DWARF to describe both C style arrays, a pointer to an int on our typical 32­bit ma­ which always have zero as the lowest in­ Variables chine. This DIE defines a pointer type, spec­ dex, as well as arrays in Pascal or Ada, ifies that its size is four bytes, and in turn which can have any value for the low and ariables are generally pretty simple. references the int base type. Other DIEs de­ high bounds. VThey have a name which represents a scribe the const or volatile attributes, C++ chunk of memory (or register) that can reference type, or C restrict types. These contain some kind of a value. The kind of type DIEs can be chained together to de­ Structures, Classes, values that the variable can contain, as well Unions, and as restrictions on how it can be modified (e.g., whether it is const) are described by <1>: DW_TAG_variable Interfaces the type of the variable. DW_AT_name = px Most languages allow the DW_AT_type = <2> What distinguishes variables is where programmer to group data to­ the variable is stored and its scope. The <2>: DW_TAG_pointer_type gether into structures (called scope of a variable defines where the vari­ DW_AT_byte_size = 4 struct in C and C++, class in able known within the program and is, to DW_AT_type = <3> C++, and record in Pascal). Each some degree, determined by where the of the components of the struc­ <3>: DW_TAG_base_type variable is declared. In C, variables de­ DW_AT_name = word ture generally has a unique name clared within a function or block have func­ DW_AT_byte_size = 4 and may have a different type, tion or block scope. Those declared outside DW_AT_encoding = signed and each occupies its own space. a function have either global or file scope. C and C++ have the union and This allows different variables with the Figure 5. DWARF description of “int *px”. Pascal has the variant record that same name to be defined in different files are similar to a structure except without conflicting. It also allows different that the component occupy the functions or compilations to reference the same memory locations. The same variable. DWARF documents where Java interface has a subset of the the variable is declared in the source file <1>: DW_TAG_variable properties of a C++ class, since DW_AT_name = argv with a (file, line, column) triplet. DW_AT_type = <2> it may only have abstract meth­ ods and constant data members. DWARF splits variables into three cate­ <2>: DW_TAG_pointer_type gories: constants, function parameters, and DW_AT_byte_size = 4 Although each language has variables. A constant is used to describe DW_AT_type = <3> its own terminology (C++ calls languages that have true named constants the components of a class mem­ as part of the language, such as Ada param­ <3>: DW_TAG_pointer_type bers while Pascal calls them eters. (C does not have constants as part of DW_AT_byte_size = 4 fields) the underlying organiza­ DW_AT_type = <4> the language. Declaring a variable const just tion can be described in DWARF. says that you cannot modify the variable <4>: DW_TAG_const_type True to its heritage, DWARF uses without using an explicit cast.) A formal pa­ DW_AT_type = <5> the C/C++ terminology and has rameter represents values passed to a func­ DIEs which describe struct, tion. We'll come back to that a bit later. <5>: DW_TAG_base_type union, class, and interface. We'll DW_AT_name = char Some languages, like C or C++ (but DW_AT_byte_size = 1 describe the class DIE here, but DW_AT_encoding = unsigned the others have essentially the not Pascal), allow a variable to be declared same organization. without defining it. This implies that there Figure 6. DWARF description of should be a real definition of the variable The DIE for a class is the par­ “const char **argv”. somewhere else, hopefully somewhere that ent of the DIEs which describe the compiler or debugger can find. A DIE each of the class's members. describing a variable declaration provides a scribe more complex data types, such as Each class has a name and possi­ description of the variable without actually “const char **argv” which is de­ bly other attributes. If the size of an in­ telling the debugger where it is. scribed in Figure 6. stance is known at compile time, then it will have a byte size attribute. Each of these Most variables have a location attribute descriptions looks very much like the de­ that describes where the variable is stored. scription of a simple variable, although In the simplest of cases, a variable is stored 4 there may be some additional attributes. in memory and has a fixed address . But 3 Some compilers define a common set of type defini­ For example, C++ allows the programmer tions at the start of every compilation unit. Others only to specify whether a member is public, pri­ 4 Well, maybe not a fixed address, but one that is a fixed generate the definitions for the types which are actually offset from where the executable is loaded. The loader referenced in the program. Either is valid. relocates references to addresses within an executable

Introduction to the DWARF Debugging Format 5 Michael J. Eager many variables, such as those declared Describing A function may define variables that within a C function, are dynamically allo­ may be local or global. The DIEs for these cated and locating them requires some Executable Code variables follow the parameter DIEs. Many (usually simple) computation. For example, languages allow nesting of lexical blocks. a local variable may be allocated on the Functions and Subprograms These are represented by lexical block DIEs stack, and locating it may be as simple as which in turn, may own variable DIEs or adding a fixed offset to a frame pointer. In WARF treats functions that return val­ nested lexical block DIEs. other cases, the variable may be stored in a Dues and that do not as Here is a somewhat longer example. register. Other variables may require some­ variations of the same . Drifting slight­ Figure 8a shows the source for what more complicated computations to lo­ ly away from its roots in C terminology, cate the data. A variable that is a member DWARF describes both with a Subprogram strndup.c, a function in gcc that dupli­ of a C++ class may require more complex DIE. This DIE has a name, a source location cates a string. Figure 8b lists the DWARF computations to determine the location of triplet, and an attribute which indicates generated for this file. As in previous exam­ the base class within a derived class. whether the subprogram is external, that is, ples, the source line information and the lo­ visible outside the current compilation. cation attributes are not shown. Location A Subprogram DIE has attributes that In Figure 8b, DIE shows the defi­ give the low and high memory addresses nition of size_t which is a typdef of un- Expressions that the subprogram occupies, if it is con­ signed int. This allows a debugger to WARF provides a very general scheme tiguous, or a list of memory ranges if the display the type of formal argument n as a Dto describe how to locate the data rep­ function does not occupy a contiguous set size_t, while displaying its value as an resented by a variable. A DWARF location of memory addresses. The low PC address unsigned integer. DIE describes the expression contains a sequence of opera­ is assumed to be the entry point for the function strndup. This has a pointer to its tions which tell a debugger how to locate routine unless another one is explicitly sibling, DIE <10>; all of the following the data. Figure 7 shows DIEs for three specified. DIEs are children of the Subprogram DIE. variables named a, b, and c. Variable a has The value that a function returns is giv­ The function returns a pointer to char, de­ a fixed location in memory, variable b is in en by the type attribute. Subroutines that scribed in DIE <10>. DIE also de­ register 0, and variable c is at offset –12 do not return values (like C void functions) scribes the as external and pro­ within the current function's stack frame. do not have this attribute. DWARF doesn't totyped and gives the low and high PC val­ Although a was declared first, the DIE to describe the calling conventions for a func­ ues for the routine. The formal parameters describe it is generated later, after all func­ tion; that is defined in the Application Bina­ and local variables of the routine are de­ tions. The actual location of a will be filled ry Interface (ABI) for the particular archi­ scribed in DIEs <6> to <9>. in by the . tecture. There may be attributes that help a The DWARF location expression can debugger to locate the subpro­ contain a sequence of operators and values gram's data or to find the current that are evaluated by a simple stack ma­ subprogram's caller. The return address attribute is a location ex­ fig7.c: chine. This can be an arbitrarily complex 1: int a; computation, with a wide range of arith­ pression that specifies where the 2: void foo() metic operations, tests and branches within address of the caller is stored. 3: { the expression, calls to evaluate other loca­ The frame base attribute is a lo­ 4: register int b; tion expressions, and accesses to the pro­ cation expression that computes 5: int c; 6: } cessor's memory or registers. There are the address of the stack frame even operations used to describe data for the function. These are useful <1>: DW_TAG_subprogram which is split up and stored in different lo­ since some of the most common DW_AT_name = foo cations, such as a structure where some optimizations that a compiler <2>: DW_TAG_variable data are stored in memory and some are might do are to eliminate in­ DW_AT_name = b structions that explicitly save the DW_AT_type = <4> stored in registers. DW_AT_location = return address or frame pointer. Although this great flexibility is seldom (DW_OP_reg0) The subprogram DIE owns <3>: DW_TAG_variable used in practice, the location expression DW_AT_name = c should allow the location of a variable's DIEs that describe the subpro­ DW_AT_type = <4> data to be described no matter how com­ gram. The parameters that may DW_AT_location = plex the language definition or how clever be passed to a function are rep­ (DW_OP_fbreg: -12) the compiler's optimizations. resented by variable DIEs which <4>: DW_TAG_base_type have the variable parameter at­ DW_AT_name = int DW_AT_byte_size = 4 tribute. If the parameter is op­ DW_AT_encoding = signed tional or has a default value, <5>: DW_TAG_variable these are represented by at­ DW_AT_name = a tributes. The DIEs for the param­ DW_AT_type = <4> eters are in the same order as DW_AT_external = 1 DW_AT_location = the argument list for the func­ (DW_OP_addr: 0) tion, but there may be additional Figure 7. DWARF description of variables DIEs interspersed, for example, a, b, and c. so that at run­time the location attribute contains the to define types used by the pa­ actual memory address. In an object file, the location at­ tribute is the offset, along with an appropriate reloca­ rameters. tion table entry.

Introduction to the DWARF Debugging Format 6 Michael J. Eager about the compila­ are in the same order in which they appear strndup.c: 1: #include "ansidecl.h" tion, including the di­ in the source file. 2: #include rectory and name of 3: the source file, the Data encoding 4: extern size_t strlen (const char*); programming lan­ 5: extern PTR malloc (size_t); guage used, a string onceptually, the DWARF data that de­ 6: extern PTR memcpy (PTR, const PTR, size_t); which identifies the Cscribes a program is a tree. Each DIE 7: may have a sibling and several DIEs that it 8: char * producer of the 9: strndup (const char *s, size_t n) DWARF data, and off­ contains. Each of the DIEs has a type 10: { sets into the DWARF (called its TAG) and a number of attributes. 11: char *result; data sections to help Each attributes is represented by a attribute 12: size_t len = strlen (s); type and a value. Unfortunately, this is not 13: locate the line num­ ber and macro infor­ a very dense encoding. Without compres­ 14: if (n < len) sion, the DWARF data is unwieldy. 15: len = n; mation. 16: DWARF Versions 2 and 3 offer several 17: result = (char *) malloc (len + 1); If the compilation 18: if (!result) unit is contiguous ways to reduce the size of the data which 19: return 0; (i.e., it is loaded into needs to be saved with the object file. The 20: memory in one piece) first is to "flatten" the tree by saving it in 21: result[len] = '\0'; prefix order. Each type of DIE is defined to 22: return (char *) memcpy (result, s, len); then there are values for the low and high either have children or not. If the DIE can­ 23: } not have children, the next DIE is its sib­ Figure 8a. Source for strndup.c. memory addresses for the unit. This makes it ling. If the DIE can have children, then the easier for a debugger next DIE is its first child. The remaining children are represented as the siblings of this first child. This way, links to the sibling <1>: DW_TAG_base_type <7>: DW_TAG_formal_parameter or child DIEs can be eliminated. If the DW_AT_name = int DW_AT_name = n compiler writer thinks that it might be use­ DW_AT_byte_size = 4 DW_AT_type = <2> ful to be able to jump from one DIE to its DW_AT_encoding = signed DW_AT_location = sibling without stepping through each of its <2>: DW_TAG_typedef (DW_OP_fbreg: 4) children DIEs (for example, to jump to the DW_AT_name = size_t <8>: DW_TAG_variable DW_AT_type = <3> DW_AT_name = result next function in a compilation) then a sib­ <3>: DW_TAG_base_type DW_AT_type = <10> ling attribute can be added to the DIE. DW_AT_name = unsigned int DW_AT_location = DW_AT_byte_size = 4 (DW_OP_fbreg: -28) A second scheme to compress the data DW_AT_encoding = unsigned <9>: DW_TAG_variable is to use abbreviations. Although DWARF <4>: DW_TAG_base_type DW_AT_name = len allows great flexibility in which DIEs and DW_AT_name = long int DW_AT_type = <2> attributes it may generate, most compilers DW_AT_byte_size = 4 DW_AT_location = only generate a limited set of DIEs, all of DW_AT_encoding = signed (DW_OP_fbreg: -24) <5>: DW_TAG_subprogram <10>: DW_TAG_pointer_type which have the same set of attributes. In­ DW_AT_sibling = <10> DW_AT_byte_size = 4 stead of storing the value of the TAG of the DW_AT_external = 1 DW_AT_type = <11> DIE and the attribute­value pairs, only an DW_AT_name = strndup <11>: DW_TAG_base_type index into a table of abbreviations is stored, DW_AT_prototyped = 1 DW_AT_name = char followed by the attribute codes. Each ab­ DW_AT_type = <10> DW_AT_byte_size = 1 DW_AT_low_pc = 0 DW_AT_encoding = breviation gives the tag value, a flag indi­ DW_AT_high_pc = 0x7b signed char cating whether the DIE has children, and a <6>: DW_TAG_formal_parameter <12>: DW_TAG_pointer_type list of attributes with the type of value it ex­ DW_AT_name = s DW_AT_byte_size = 4 pects. Figure 9 shows the abbreviation for DW_AT_type = <12> DW_AT_type = <13> the formal parameter DIE used in Figure DW_AT_location = <13>: DW_TAG_const_type (DW_OP_fbreg: 0) DW_AT_type = <11> 8b. DIE <6> in Figure 8 is actually encod­ 5 Figure 8b. DWARF description for strndup.c. ed as shown . This is a significant reduction in the amount of data that needs to be saved at some expense in added complexi­ ty. Compilation Unit to identify which compilation unit created the code at a particular memory address. If Less commonly used are features of ost interesting programs consists of the compilation unit is not contiguous, then DWARF Version 3 which allow references more than a single file. Each source M a list of the memory addresses that the from one compilation unit to the DWARF file that makes up a program is compiled code occupies is provided by the compiler data stored in another compilation unit or independently and then linked together and linker. in a shared . Many compilers gener­ with system libraries to make up the pro­ ate the same abbreviation table or base gram. DWARF calls each separately com­ The Compilation Unit DIE is the parent types for every compilation, independent of piled source file a compilation unit. of all of the DIEs that describe the compila­ whether the compilation actually uses all of tion unit. Generally, the first DIEs will de­ The DWARF data for each compilation the abbreviations or types. These can be scribe data types, followed by global data, unit starts with a Compilation Unit DIE. then the functions that make up the source This DIE contains general information 5 file. The DIEs for variables and functions The encoded entry also includes the file and line values which are not shown in Fig. 8b.

Introduction to the DWARF Debugging Format 7 Michael J. Eager for this row in the line number program. Abbrev 5: DW_TAG_formal_parameter [no children] DW_AT_name DW_FORM_string Figure 10 lists the line number program for DW_AT_decl_file DW_FORM_data1 strndup.c. Notice that only the machine DW_AT_decl_line DW_FORM_data1 addresses that represent the beginning in­ DW_AT_type DW_FORM_ref4 struction of a statement are stored. The DW_AT_location DW_FORM_block1 compiler did not identify the basic blocks in this code, the end of the prolog or the start ┌──────────────────────────────── abbreviation 5 │ ┌──────────────────────────── ”s” of the epilog to the function. This table is │ │ ┌──────────────────────── file 1 encoded in just 31 bytes in the line number │ │ │ ┌───────────────────── line 41 program. │ │ │ │ ┌──────────────── type DIE offset │ │ │ │ │ ┌───────── location (fbreg + 0) Macro Information │ │ │ │ │ │ ┌───── terminating NUL 05 7300 01 29 0000010c 9100 00 ost debuggers have a very difficult Mtime displaying and debugging code Figure 9. Abbreviation entry and encoded form. which has macros. The user sees the origi­ nal source file, with the macros, while the code corresponds to whatever the macros saved in a shared library and referenced by presses this data by encoding it as sequence generated. each compilation unit, rather than being of instructions called a line number pro­ duplicated in each. gram6. These instructions are interpreted by DWARF includes the description of the a simple finite state machine to recreate the macros defined in the program. This is complete line number table. quite rudimentary information, but can be Other DWARF Data used by a debugger to display the values The finite state machine is initialized for a macro or possibly translate the macro Line Number Table with a set of default values. Each row in the into the corresponding source language. line number he DWARF line table contains the map­ table is gener­ Tping between the source lines (for the ated by exe­ executable parts of a program) and the cuting one or memory that contains the code that corre­ more of the sponds to the source. In the simplest form, opcodes of the Address File Line Col Stmt Block End Prolog Epilog ISA this can be looked at as a matrix with one line number column containing the memory addresses program. The 0x0 0 42 0 yes no no no no 0 and another column containing the source opcodes are triplet (file, line, and column). If you want 0x9 0 44 0 yes no no no no 0 generally 0x1a 0 44 0 yes no no no no 0 to set a breakpoint at a particular line, the quite simple: 0x24 0 46 0 yes no no no no 0 table gives you the memory address to for example, 0x2c 0 47 0 yes no no no no 0 store the breakpoint instruction. Converse­ add a value to 0x32 0 49 0 yes no no no no 0 ly, if your program has a fault (say, using a either the ma­ 0x41 0 50 0 yes no no no no 0 bad pointer) at some location in memory, chine address 0x47 0 51 0 yes no no no no 0 you can look for the source line that is clos­ or to the line 0x50 0 53 0 yes no no no no 0 est to the memory address. number, set 0x59 0 54 0 yes no no no no 0 0x6a 0 54 0 yes no no no no 0 DWARF has extended this with added the column number, or 0x73 0 55 0 yes no no no no 0 columns to convey additional information 0x7b 0 56 0 yes no yes no no 0 about a program. As a compiler optimizes set a flag File 0: strndup.c the program, it may move instructions which indi­ File 1: stddef.h around or remove them. The code for a giv­ cates that the en source statement may not be stored as a memory ad­ sequence of machine instructions, but may dress repre­ Figure 10. Line Number Table for strndup.c. be scattered and interleaved with the in­ sents the start structions for other nearby source state­ of an source ments. It may be useful to identify the end statement, the Call Frame Information of the code which represents the prolog of end of the function prolog, or the start of a function or the beginning of the epilog, so the function epilog. A set of special opcodes very processor has a certain way of that the debugger can stop after all of the combine the most common operations (in­ Ecalling functions and passing argu­ arguments to a function have been loaded crementing the memory address and either ments, usually defined in the ABI. In the or before the function returns. Some pro­ incrementing or decrementing the source simplest case, this is the same for each cessors can execute more than one instruc­ line number) into a single opcode. function and the debugger knows exactly tion set, so there is another column that in­ how to find the argument values and the Finally, if a row of the line number ta­ return address for the function. dicates which set is stored at the specified ble has the same source triplet as the previ­ machine location. ous row, then no instructions are generated For some processors, there may be dif­ ferent calling sequences depending on how As you might imagine, if this table were 6 Calling this a line number program is something of a the function is written, for example, if there stored with one row for each machine in­ misnomer. The program describes much more than just struction, it would be huge. DWARF com­ line numbers, such as instruction set, beginning of basic are more than a certain number of argu­ blocks, end of function prolog, etc. ments. There may be different calling se­

Introduction to the DWARF Debugging Format 8 Michael J. Eager quences depending on operating systems. ELF sections Summary Compilers will try to optimize the calling hile DWARF is defined in a way that sequence to make code both smaller and o there you have it ─ DWARF in a nut­ allows it to be used with any object faster. One common optimization is when W Sshell. Well, not quite a nutshell. The ba­ file format, it's most often used with ELF. there is a simple function which doesn't call sic concepts for the DWARF debug informa­ Each of the different kinds of DWARF data any others (a leaf function) to use its caller tion are straight­forward. A program is de­ are stored in their own section. The names stack frame instead of creating its own. An­ scribed as a tree with nodes representing of these sections all start with ".debug_". other optimization may be to eliminate a the various functions, data and types in the For improved efficiency, most references to register which points to the current call source in a compact language­ and ma­ DWARF data use an offset from the start of frame. Some registers may be preserved chine­independent fashion. The line table the data for the current compilation. This across the call while others are not. While it provides the mapping between the exe­ avoids the need to relocate the debugging may be possible for the debugger to puzzle cutable instructions and the source that data, which speeds up program loading and out all the possible permutations in calling generated them. The CFI describes how to debugging. sequence or optimizations, it is both te­ unwind the stack. dious and error­prone. A small change in The ELF sections and their contents are There is quite a bit of subtlety in the optimizations and the debugger may no DWARF as well, given that it needs to ex­ longer be able to walk the stack to the call­ .debug_abbrev Abbreviations used in the .debug_info section press the many different nuances for a wide ing function. range of programming languages and dif­ .debug_aranges A mapping between ferent machine architectures. Future direc­ The DWARF Call Frame Information memory address and (CFI) provides the debugger with enough compilation tions for DWARF are to improve the de­ scription of optimized code so that debug­ information about how a function is called .debug_frame Call Frame Information so that it can locate each of the arguments gers can better navigate the code which ad­ to the function, locate the current call vanced compiler optimizations generate. .debug_info The core DWARF data frame, and locate the call frame for the containing DIEs The complete DWARF Version 3 Stan­ calling function. This information is used by dard is available for download without cost .debug_line Line Number Program the debugger to "unwind the stack," locat­ at the DWARF website (dwarf.freestandard­ ing the previous function, the location s.org). There is also a mailing list for ques­ .debug_loc where the function was called, and the val­ Macro descriptions tions and discussion about DWARF. In­ ues passed. structions on registering for the mailing list .debug_macinfo Like the line number table, the CFI is A lookup table for global are also on the website. objects and functions encoded as a sequence of instructions that are interpreted to generate a table. There is .debug_pubnames A lookup table for global Acknowledgements one row in this table for each address that objects and functions I want to thank Chris Quenelle of Sun contains code. The first column contains .debug_pubtypes A lookup table for global the machine address while the subsequent types Microsystems and Ron Brender, formerly of HP, for their comments and advice about columns contain the values of the machine .debug_ranges Address ranges this paper. Thanks also to Susan Heimlich registers when the instruction at that ad­ referenced by DIEs dress is executed. Like the line number ta­ for her many editorial comments. .debug_str String table used by ble, if this table were actually created it .debug_info would be huge. Luckily, very little changes between two machine instructions, so the CFI encoding is quite compact.

Introduction to the DWARF Debugging Format 9 Michael J. Eager Generating DWARF with GCC It’s very simple to generate DWARF with gcc. Simply specify the –g option to generate debugging information. The ELF sections can be displayed using objump with the –h option. $ gcc –g –c strndup.c $ objdump –h strndup.o strndup.o: file format elf32-i386

Sections: Idx Name Size VMA LMA File off Algn 0 .text 0000007b 00000000 00000000 00000034 2**2 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 1 .data 00000000 00000000 00000000 000000b0 2**2 CONTENTS, ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 000000b0 2**2 ALLOC 3 .debug_abbrev 00000073 00000000 00000000 000000b0 2**0 CONTENTS, READONLY, DEBUGGING 4 .debug_info 00000118 00000000 00000000 00000123 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 5 .debug_line 00000080 00000000 00000000 0000023b 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 6 .debug_frame 00000034 00000000 00000000 000002bc 2**2 CONTENTS, RELOC, READONLY, DEBUGGING 7 .debug_loc 0000002c 00000000 00000000 000002f0 2**0 CONTENTS, READONLY, DEBUGGING 8 .debug_pubnames 0000001e 00000000 00000000 0000031c 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 9 .debug_aranges 00000020 00000000 00000000 0000033a 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 10 .comment 0000002a 00000000 00000000 0000035a 2**0 CONTENTS, READONLY 11 .note.GNU-stack 00000000 00000000 00000000 00000384 2**0 CONTENTS, READONLY

Printing DWARF using Readelf Readelf can display and decode the DWARF data in an object or executable file. The options are -w display all DWARF sections -w[liaprmfFso] display specific sections l line table i debug info a abbreviation table p public names r ranges m macro table f debug frame (encoded) F debug frame (decoded) s string table o location lists

The DWARF listing for all but the smallest programs is quite voluminous, so it would be a good idea to direct readelf’s output to a file and then browse the file with less or an editor such as vi.

Introduction to the DWARF Debugging Format 10 Michael J. Eager