Introduction to the DWARF Debugging Format Michael J. Eager, Eager Consulting April, 2012

It would be wonderful if we could write memory addresses, and binary values of low­level code to the original source programs that were guaranteed to work which the processor actually understands. which generated it. correctly and never needed to be debugged. After all, the processor really doesn't care The second challenge is how to describe Until that halcyon day, the normal pro­ whether you used object oriented program­ the program and its relationship gramming cycle is going to involve writing ming, templates, or smart pointers; it only to the original source with enough detail to a program, compiling it, executing it, and understands a very simple set of operations allow a debugger to provide the program­ then the (somewhat) dreaded scourge of on a limited number of registers and mem­ mer useful information. At the same time, debugging it. And then repeat until the pro­ ory locations containing binary values. the description has to be concise enough so gram works as expected. As a compiler reads and parses the that it does not take up an extreme amount It is possible to debug programs by in­ source of a program, it collects a variety of of space or require significant processor serting code that prints values of selected information about the program, such as the time to interpret. This is where the DWARF interesting variables. Indeed, in some situa­ line numbers where a variable or function Debugging Format comes in: it is a compact tions, such as debugging kernel drivers, this is declared or used. Semantic analysis ex­ representation of the relationship between may be the preferred method. There are tends this information to fill in details such the executable program and the source in a low­level debuggers that allow you to step as the types of variables and arguments of way that is reasonably efficient for a debug­ through the executable program, instruc­ functions. Optimizations may move parts of ger to process. tion by instruction, displaying registers and the program around, combine similar memory contents in binary. pieces, expand inline functions, or remove parts which are unneeded. Finally, code The Debugging But it is much easier to use a source­lev­ generation takes this internal representa­ Process el debugger which allows you to step tion of the program and generates the actu­ through a program's source, set break­ hen a programmer runs a program al machine instructions. Often, there is an­ points, print variable values, and perhaps a under a debugger, there are some other pass over the machine code to per­ W few other functions such as allowing you to common operations which he or she may form what are called "peephole" optimiza­ call a function in your program while in the want to do. The most common of these are tions that may further rearrange or modify debugger. The problem is how to coordi­ setting a breakpoint to stop the debugger at the code, for example, to eliminate dupli­ nate two completely different programs, a particular point in the source, either by cate instructions. the compiler and the debugger, so that the specifying the line number or a function program can be debugged. All­in­all, the compiler's task is to take name. When this breakpoint is hit, the pro­ the well­crafted and understandable source grammer usually would like to display the code and convert it into efficient but essen­ values of local or global variables, or the ­ Translating from tially unintelligible machine language. The guments to the function. Displaying the Source to Executable better the compiler achieves the goal of cre­ call stack lets the programmer know how the program arrived at the breakpoint in he process of compiling a program ating tight and fast code, the more likely it cases where there are multiple execution from human­readable form into the bi­ is that the result will be difficult to under­ T paths. After reviewing this information, the nary form that a processor executes is quite stand. programmer can ask the debugger to con­ complex, but it essentially involves succes­ During this translation process, the tinue execution of the program under test. sively recasting the source into simpler and compiler collects information about the simpler forms, discarding information at program which will be useful later when There are a number of additional opera­ each step until, eventually, the result is the the program is debugged. There are two tions that are useful in debugging. For ex­ sequence of simple operations, registers, challenges to doing this well. The first is ample, it may be helpful to be able to step that in the later parts of this process, it may through a program line by line, either en­ tering or stepping over called functions. Michael Eager is Principal Consultant at be difficult for the compiler to relate the Setting a breakpoint at every instance of a Eager Consulting (www.eagercon.com), changes it is making to the program to the template or inline function can be impor­ specializing in development tools for original source code that the programmer tant for debugging ++ programs. It can embedded systems. He was a member wrote. For example, the peephole optimizer be helpful to stop just before the end of a of PLSIG's DWARF standardization com­ may remove an instruction because it was function so that the return value can be dis­ mittee and has been Chair of the able to switch around the order of a test in played or changed. Sometimes the pro­ DWARF Standards Committee since code that was generated by an inline func­ grammer may want to bypass execution of 1999. Michael can be contacted at tion in the instantiation of a C++ template. a function, returning a known value instead [email protected]. By the time it gets its metaphorical hands of what the function would have (possibly © Eager Consulting, 2006, 2007, 2012 on the program, the optimizer may have a difficult time connecting its manipulations incorrectly) computed. There are also data related operations while attempting to reverse engineer the A Brief History of that are useful. For example, displaying Sun extensions. Nonetheless, stabs is still the type of a variable can avoid having to widely used. DWARF look up the type in the source files. Dis­ COFF stands for Common playing the value of a variable in different Format and originated with System V DWARF 1 ─ Unix SVR4 sdb formats, or displaying a memory or register Release 3. Rudimentary debugging infor­ in a specified format is helpful. and PLSIG mation was defined with the COFF format, WARF3 was developed by Brian Rus­ There are some operations which might but since COFF includes support for named Dsell, Ph.D., at Bell Labs in 1988 for use be called advanced debugging functions: sections, a variety of different debugging with the C compiler and sdb debugger in for example, being able to debug multi­ formats such as stabs have been used with Unix System V Release 4 (SVR4). The Pro­ threaded programs or programs stored in COFF. The most significant problem with gramming Languages Special Interest read­only memory. One might want a de­ COFF is that despite the Common in its Group (PLSIG), part of Unix International bugger (or some other program analysis name, it isn’t the same in each architecture (UI), documented the DWARF generated by tool) to keep track of whether certain sec­ which uses the format. There are many SVR4 as DWARF Version 1 in 1992. Al­ tions of code had been executed or not. variations in COFF, including XCOFF (used though the original DWARF had several Some debuggers allow the programmer to on IBM RS/6000), ECOFF (used on MIPS clear shortcomings, most notably that it call functions in the program being tested. and Alpha), and Windows PE­COFF. Docu­ was not very compact, the PLSIG decided to In the not­so­distant past, debugging pro­ mentation of these variants is available to standardize the SVR4 format with only grams that had been optimized would have varying degrees but neither the object mod­ minimal modification. It was widely adopt­ been considered an advanced feature. ule format nor the debugging information ed within the embedded sector where it is standardized. The task of a debugger is to provide the continues to be used today, especially for programmer with a view of the executing PE­COFF is the object module format small processors. program in as natural and understandable used by Windows beginning with fashion as possible, while permitting a wide Windows 95. It is based on the COFF for­ DWARF 2 ─ PLSIG range of control over its execution. This mat and contains both COFF debugging he PLSIG continued to develop and means that the debugger has to essentially data and Microsoft’s own proprietary Code­ document extensions to DWARF to ad­ reverse much of the compiler’s carefully View or CV4 debugging data format. Docu­ T dress several issues, the most important of crafted transformations, converting the pro­ mentation on the debugging format is both which was to reduce the size of debugging gram’s data and state back into the terms sketchy and difficult to obtain. data that were generated. There were also that the programmer originally used in the OMF stands for Object Module Format additions to support new languages such as program’s source. and is the object used in CP/M, the up­and­coming C++ language. DWARF The challenge of a debugging data for­ DOS and OS/2 systems, as well as a small Version 2 was released as a draft standard mat, like DWARF, is to make this possible number of embedded systems. OMF de­ in 1993. and even easy. fines public name and line number infor­ In an example of the domino theory in mation for debuggers and can also contain action, shortly after PLSIG released the Microsoft CV, IBM PM, or AIX format de­ draft standard, fatal flaws were discovered Debugging Formats bugging data. OMF only provides the most in Motorola's 88000 microprocessor. Mo­ here are several debugging formats: rudimentary support for debuggers. torola pulled the plug on the processor, stabs, COFF, PE­COFF, OMF, IEEE­695, T which in turn resulted in the demise of and two variants1 of DWARF, to name some IEEE­695 is a standard object file and Open88, a consortium of companies that common ones. I’m not going to describe debugging format developed jointly by Mi­ were developing computers using the these in any detail. The intent here is only crotec Research and HP in the late 1980’s 88000. Open88 in turn was a supporter of to mention them to place the DWARF De­ for embedded environments. It became an Unix International, sponsor of PLSIG, which bugging Format in context. IEEE standard in 1990. It is a very flexible specification, intended to be usable with al­ resulted in UI being disbanded. When UI The name stabs comes from symbol ta­ most any machine architecture. The de­ folded, all that remained of the PLSIG was ble strings, since the debugging data were bugging format is block structured, which a mailing list and a variety of ftp sites that originally saved as strings in Unix’s a.out corresponds to the organization of the had various versions of the DWARF 2 draft object file’s symbol table. Stabs encodes source better than other formats. Although standard. A final standard was never re­ the information about a program in text it is an IEEE standard, in many ways IEEE­ leased. strings. Initially quite simple, stabs has 695 is more like the proprietary formats. Since Unix International had disap­ evolved over time into a quite complex, oc­ Although the original standard is readily peared and PLSIG disbanded, several orga­ casionally cryptic and less­than­consistent available from IEEE, Microtec Research nizations independently decided to extend debugging format. Stabs is not standard­ made extensions to support C++ and opti­ DWARF 1 and 2. Some of these extensions ized nor well documented2. Sun Microsys­ mized code which are poorly documented. were specific to a single architecture, but tems has made a number of extensions to The IEEE standard was never revised to in­ others might be applicable to any architec­ stabs. GCC has made other extensions, corporate the Microtec Research or other ture. Unfortunately, the different organiza­ changes. Despite being an IEEE standard, tions didn’t work together on these exten­ 1 it's use is limited to a few small processors. DWARF Version 1 is significantly different from sions. Documentation on the extensions is Versions 2 and later. 2 In 1992, the author wrote an extensive docu­ 3 The name DWARF is something of a pun, since ment describing the stabs generated by Sun Mi­ it was developed along with the ELF object file crosytems' compilers. Unfortunately, it was never format. The name is an acronym for “Debugging widely distributed. With Arbitrary Record Formats”.

Introduction to the DWARF Debugging Format 2 Michael J. Eager generally spotty or difficult to obtain. Or as Standard was released in June, 2010, fol­ While DWARF is most commonly asso­ a GCC developer might suggest, tongue lowing a public review. ciated with the ELF object file format, it is firmly in cheek, the extensions were well independent of the object file format. It Work on DWARF Version 5 started in documented: all you have to do is read the can and has been used with other object February, 2012. This version is expected compiler source code. DWARF was well on file formats. All that is necessary is that the to be completed in 2014. its way to following COFF and becoming a different data sections that make up the collection of divergent implementations DWARF data be identifiable in the object 4 rather than being an industry standard. DWARF Overview file or executable. DWARF does not dupli­ cate information that is contained in the ost modern programming languages object file, such as identifying the processor DWARF 3 ─ Free Standards are block structured: each entity (a M architecture or whether the file is written in class definition or a function, for example) Group big­endian or little­endian format. espite several on­line discussions is contained within another entity. Each Dabout DWARF on the PLSIG email list file in a C program may contain multiple (which survived under X/Open [later Open data definitions, multiple variable defini­ Debugging Group] sponsorship after UI’s demise), tions, and multiple functions. Within each Information Entry there was little impetus to revise (or even C function there may be several data defini­ finalize) the document until the end of tions followed by executable statements. A (DIE) 1999. At that time, there was interest in statement may be a compound statement extending DWARF to have better support that in turn can contain data definitions Tags and Attributes and executable statements. This creates lex­ for the HP/Intel IA­64 architecture as well he basic descriptive entity in DWARF is as better documentation of the ABI used by ical scopes, where names are known only within the scope in which they are defined. Tthe Debugging Information Entry C++ programs. These two efforts separat­ (DIE). A DIE has a tag, which specifies ed, and the author took over as Chair for To find the definition of a particular symbol what the DIE describes and a list of at­ the revived DWARF Committee. in a program, you first look in the current scope, then in successive enclosing scopes tributes which fill in details and further de­ Following more than 18 months of de­ until you find the symbol. There may be scribes the entity. A DIE (except for the top­ velopment work and creation of a draft of multiple definitions of the same name in most) is contained in or owned by a parent the DWARF 3 specification, the standard­ different scopes. Compilers very naturally DIE and may have sibling DIEs or children ization effort hit what might be called a soft represent a program internally as a tree. DIEs. Attributes may contain a variety of patch. The committee (and this author, in values: constants (such as a function particular) wanted to insure that the DWARF follows this model in that it is name), variables (such as the start address DWARF standard was readily available and also block structured. Each descriptive enti­ for a function), or references to another to avoid the possible divergence caused by ty in DWARF (except for the topmost entry DIE (such as for the type of a function’s re­ multiple sources for the standard. The which describes the source file) is con­ turn value). DWARF Committee became the DWARF tained within a parent entry and may con­ Figure 1 shows C's classic hello.c Workgroup of the Free Standards Group in tain children entities. If a node contains program with a simplified graphical repre­ 2003. Active development and clarification multiple entities, they are all siblings, relat­ sentation of its DWARF description. The of the DWARF 3 Standard resumed early in ed to each other. The DWARF description topmost DIE represents the compilation 2005 with the goal to resolve any open is­ of a program is a tree structure, similar to unit. It has two “children”, the first is the sues in the standard. A public review draft the compiler’s internal tree, where each DIE describing main and the second de­ was released to solicit public comments in node can have children or siblings. The scribing the base type int which is the type October and the final version of the nodes may represent types, variables, or of the value returned by main. The sub­ DWARF 3 Standard was released in Decem­ functions. This is a compact format where program DIE is a child of the compilation ber, 2005. only the information that is needed to de­ scribe an aspect of a program is provided. unit DIE, while the base type DIE is refer­ The format is extensible in a uniform fash­ enced by the Type attribute in the subpro­ DWARF 4 ─ DWARF ion, so that a debugger can recognize and gram DIE. We also talk about a DIE “own­ Debugging Format Committee ignore an extension, even if it might not ing” or “containing” the children DIEs. After the Free Standards Group merged understand its meaning. (This is much bet­ with Open Source Development Labs ter than the situation with most other de­ Types of DIEs (OSDL) in 2007 to form the Linux Founda­ bugging formats where the debugger gets fatally confused attempting to read unrec­ IEs can be split into two general types. tion, the DWARF Committee returned to in­ Those that describe data including dependent status and created its own web ognized data.) DWARF is also designed to D be extensible to describe virtually any pro­ data types and those that describe functions site at dwarfstd.org. Work began on Ver­ and other executable code. sion 4 of the DWARF in 2007. This version cedural programming language on any ma­ clarified DWARF expressions, added sup­ chine architecture, rather than being bound port for VLIW architectures, improved lan­ to only describing one language or one ver­ Describing Data and guage support, generalized support for sion of a language on a limited range of ar­ Types packed data, added a new technique for chitectures. compressing the debug data by eliminating ost programming languages have so­ duplicate type descriptions, and added sup­ Mphisticated descriptions of data. There are a number of built­in data types, port for profile­based compiler optimiza­ 4 In the remainder of this paper, we will be dis­ tions, as well as extensive editing of the pointers, various data structures, and usual­ cussing DWARF Version 2 and later versions. ly ways of creating new data types. Since documentation. The DWARF Version 4 Unless otherwise noted, all descriptions apply to DWARF Versions 2 through 4. DWARF is intended to be used with a vari­

Introduction to the DWARF Debugging Format 3 Michael J. Eager used, possibly even within the same pro­ hello.c: gram. Figure 2a shows the DIE which de­ 1: int main() 2: { scribes int on a typical 32­bit processor. 3: printf("Hello World!\n"); The attributes specify the name (int), an 4: return 0; encoding (signed binary integer), and the 5: } size in bytes (4). Figure 2b shows a similar definition of int on a 16­bit processor. (In Figure 2, we use the tag and attribute DIE – Compilation Unit names defined in the DWARF standard, rather than the more informal names used Dir = /home/dwarf/examples in Figure 1. The names of tags are all pre­ Name = hello.c LowPC = 0x0 fixed with DW_TAG and the names of at­ HighPC = 0x2b tributes are prefixed with DW_AT.) Producer = GCC The base types allow the compiler to describe almost any mapping between a programming language scalar type and how it is actually implemented on the pro­ DIE – Subprogram cessor. Figure 3 describes a 16­bit integer value that is stored in the upper 16 bits of a DIE – Base Type Name = main four byte word. In this base type, there is a File = hello.c bit size attribute that specifies that the val­ Line = 2 Name = int Type = int ByteSize = 4 ue is 16 bits wide and an offset from the 5 LowPC = 0x0 Encoding = signed high­order bit of zero . HighPC = 0x2b integer External = yes The DWARF base types allow a number of different encodings to be described, in­ cluding address, character, fixed point, floating point, and packed decimal, in addi­ Figure 1. Graphical representation of DWARF data tion to binary integers. There is still a little ambiguity remaining: for example, the ac­ tual encoding for a floating point number is ety of languages, it abstracts out the basics which can hold integer values between 0 not specified; this is determined by the en­ and provides a representation that can be and 100. Pascal doesn't specify how this coding that the hardware actually supports. used for all supported language. The prima­ should be implemented. One compiler In a processor which supports both 32­bit ry types, built directly on the hardware, are might implement this as a single byte, an­ and 64­bit floating point values following the base types. Other data types are con­ other might use a 16­bit integer, a third the IEEE­754 standard, the encodings rep­ structed as collections or compositions of might implement all integer types as 32­bit resented by “float” are different depending these base types. values no matter how they are defined. on the size of the value. With DWARF Version 1 and Base Types other debugging formats, the very programming language defines compiler and debugger are sup­ DW_TAG_base_type Eseveral basic scalar data types. For ex­ posed to share a common under­ DW_AT_name = word ample, both C and Java define int and dou­ standing about whether an int is DW_AT_byte_size = 4 ble. While Java provides a complete defini­ 16, 32, or even 64 bits. This be­ DW_AT_bit_size = 16 DW_AT_bit_offset = 0 comes awkward when the same tion for these types, C only specifies some DW_AT_encoding = signed general characteristics, allowing the com­ hardware can support different plier to pick the actual specifications that size integers or when different Figure 3. 16­bit word type stored in the top 16­ best fit the target processor. Some lan­ compilers make different imple­ bits of a 32­bit word. guages, like Pascal, allow new base types to mentation decisions for the same be defined, for example, an integer type target processor. These assump­ tions, often undocu­ Type Composition mented, make it difficult to have DW_TAG_base_type named variable is described by a DIE DW_AT_name = int compatibility between different which has a variety of attributes, one DW_AT_byte_size = 4 compilers or debuggers, or even A DW_AT_encoding = signed between different versions of the of which is a reference to a type definition. same tools. Figure 4 describes an integer variable Figure 2a. int base type on 32­bit processor. named x. (For the moment we will ignore DWARF base types provide the other information that is usually con­ the lowest level mapping be­ tained in a DIE describing a variable.) DW_TAG_base_type tween the simple data types and DW_AT_name = int how they are implemented on The base type for int describes it as a DW_AT_byte_size = 2 the target machine's hardware. signed binary integer occupying four bytes. DW_AT_encoding = signed This makes the definition of int explicit for both Java and C and 5 This is a real­life example taken from an imple­ Figure 2b. int base type on 16­bit processor allows different definitions to be mentation of Pascal that passed 16­bit integers in the top half of a word on the stack.

Introduction to the DWARF Debugging Format 4 Michael J. Eager The DW_TAG_variable DIE for x gives its stored in column major order (as in Fortan) vate, or protected. These are described with name and a type attribute, which refers to or in row major order (as in C or C++). The the accessibility attribute. the base type DIE. For clarity, the DIEs are index for the array is represented by a sub­ C and C++ allow bit fields as class labeled sequentially in the this and follow­ range type that gives the lower and upper members that are not simple variables. ing examples; in the actual DWARF data, a bounds of each dimension. This allows These are described with bit offset from the reference to a DIE is the offset from the DWARF to describe both C style arrays, start of the class instance to the left­most start of the compilation unit where the DIE which always have zero as the lowest in­ bit of the bit field and bit size that says how can be found. References can be to previ­ dex, as well as arrays in Pascal or Ada, many bits the member occupies. ously defined DIEs, as in Figure 4, or to which can have any value for the low and DIEs which are defined later. Once we high bounds. have created a base type DIE for int, any Variables variable in the same compilation can refer­ ariables are generally pretty simple. ence the same DIE6. Structures, Classes, VThey have a name which represents a DWARF uses the base types to construct Unions, and chunk of memory (or register) that can other data type definitions by composition. Interfaces contain some kind of a value. The kind of values that the variable can contain, as well A new type is created as a modification of Most languages allow the programmer as restrictions on how it can be modified another type. For example, Figure 5 shows to group data together into structures (e.g., whether it is const) are described by a pointer to an int on our typical 32­bit ma­ (called struct in C and C++, class in C++, the type of the variable. chine. This DIE defines a pointer type, spec­ and record in Pascal). Each of the compo­ ifies that its size is four bytes, and in turn nents of the structure generally has a What distinguishes a variable is where references the int base type. Other DIEs de­ unique name and may have a its value is stored and its scope. The scope different type, and each occupies of a variable defines where the variable <1>: DW_TAG_base_type its own space. C and C++ have known within the program and is, to some DW_AT_name = int the union and Pascal has the degree, determined by where the variable is DW_AT_byte_size = 4 variant record that are similar to declared. In C, variables declared within a DW_AT_encoding = signed a structure except that <2>: DW_TAG_variable the component occupy DW_AT_name = x the same memory lo­ <1>: DW_TAG_variable DW_AT_type = <1> cations. The Java in­ DW_AT_name = argv terface has a subset of DW_AT_type = <2> Figure 4. DWARF description of “int x”. the properties of a C+ <2>: DW_TAG_pointer_type + class, since it may DW_AT_byte_size = 4 only have abstract DW_AT_type = <3> methods and constant <1>: DW_TAG_variable data members. <3>: DW_TAG_pointer_type DW_AT_name = px DW_AT_byte_size = 4 DW_AT_type = <2> Although each lan­ DW_AT_type = <4> guage has its own ter­ <2>: DW_TAG_pointer_type minology (C++ calls <4>: DW_TAG_const_type DW_AT_byte_size = 4 DW_AT_type = <5> DW_AT_type = <3> the components of a class members while <5>: DW_TAG_base_type <3>: DW_TAG_base_type Pascal calls them DW_AT_name = char DW_AT_name = int fields) the underlying DW_AT_byte_size = 1 DW_AT_byte_size = 4 organization can be DW_AT_encoding = unsigned DW_AT_encoding = signed described in DWARF. True to its heritage, Figure 6. DWARF description of Figure 5. DWARF description of “int *px”. DWARF uses the C/C+ “const char **argv”. +/Java terminology scribe the const or volatile attributes, C++ and has DIEs which de­ function or block have function or block reference type, or C restrict types. These scribe struct, union, class, and interface. scope. Those declared outside a function type DIEs can be chained together to de­ We'll describe the class DIE here, but the have either global or file scope. This allows scribe more complex data types, such as others have essentially the same organiza­ different variables with the same name to “const char **argv” which is described tion. be defined in different files without con­ flicting. It also allows different functions or in Figure 6. The DIE for a class is the parent of the compilations to reference the same vari­ DIEs which describe each of the class's able. DWARF documents where the vari­ members. Each class has a name and possi­ Array able is declared in the source file with a bly other attributes. If the size of an in­ (file, line, column) triplet. rray types are described by a DIE stance is known at compile time, then it which defines whether the data is A will have a byte size attribute. Each of these DWARF splits variables into three cate­ descriptions looks very much like the de­ gories: constants, formal parameters, and 6 Some compilers define a common set of type scription of a simple variable, although definitions at the start of every compilation unit. variables. A constant is used with languages Others only generate the definitions for the types there may be some additional attributes. that have true named constants as part of which are actually referenced in the program. For example, C++ allows the programmer the language, such as Ada parameters. (C Either is valid. to specify whether a member is public, pri­

Introduction to the DWARF Debugging Format 5 Michael J. Eager does not have constants as part of the lan­ adding a fixed offset to a frame pointer. In DIE. This DIE has a name, a source location guage. Declaring a variable const just says other cases, the variable may be stored in a triplet, and an attribute which indicates that you cannot modify the variable with­ register. Other variables may require some­ whether the subprogram is external, that is, out using an explicit cast.) A formal param­ what more complicated computations to lo­ visible outside the current compilation. eter represents values passed to a function. cate the data. A variable that is a member A subprogram DIE has attributes that We'll come back to that a bit later. of a C++ class may require more complex give the low and high memory addresses computations to determine the location of Some languages, like C or C++ (but not that the subprogram occupies, if it is con­ the base class within a derived class. Pascal), allow a variable to be declared tiguous, or a list of memory ranges if the without defining it. This implies that there function does not occupy a contiguous set should be a real definition of the variable Location Expressions of memory addresses. The low PC address somewhere else, hopefully somewhere that WARF provides a very general scheme is assumed to be the for the the compiler or debugger can find. A DIE Dto describe how to locate the data rep­ routine unless another one is explicitly describing a variable declaration provides a resented by a variable. A DWARF location specified. description of the variable without actually expression contains a sequence of opera­ The value that a function returns is giv­ telling the debugger where it is. tions which tell a debugger how to locate en by the type attribute. Subroutines that Most variables have a location attribute the data. Figure 7 shows DIEs for three do not return values (like C void functions) that describes where the variable is stored. variables named a, b, and c. Variable a has do not have this attribute. DWARF doesn't In the simplest of cases, a variable is stored a fixed location in memory, variable b is in describe the calling conventions for a func­ register 0, and variable c is at tion; that is defined in the Application Bina­ offset –12 within the current ry Interface (ABI) for the particular archi­ fig7.c: function's stack frame. Although tecture. There may be attributes that help a 1: int a; a was declared first, the DIE to 2: void foo() debugger to locate the subprogram's data 3: { describe it is generated later, af­ or to find the current subprogram's caller. 4: register int b; ter all functions. The actual lo­ The return address attribute is a location ex­ 5: int c; cation of a will be filled in by pression that specifies where the address of 6: } the . the caller is stored. The frame base attribute <1>: DW_TAG_subprogram The DWARF location expres­ is a location expression that computes the DW_AT_name = foo sion can contain a sequence of address of the stack frame for the function. <2>: DW_TAG_variable operators and values that are These are useful since some of the most DW_AT_name = b evaluated by a simple stack ma­ common optimizations that a compiler DW_AT_type = <4> chine. This can be an arbitrarily might do are to eliminate instructions that DW_AT_location = (DW_OP_reg0) explicitly save the return address or frame <3>: DW_TAG_variable complex computation, with a DW_AT_name = c wide range of arithmetic opera­ pointer. DW_AT_type = <4> tions, tests and branches within The subprogram DIE owns DIEs that de­ DW_AT_location = the expression, calls to evaluate scribe the subprogram. The parameters that (DW_OP_fbreg: -12) other location expressions, and <4>: DW_TAG_base_type may be passed to a function are represent­ DW_AT_name = int accesses to the processor's mem­ ed by variable DIEs which have the variable DW_AT_byte_size = 4 ory or registers. There are even parameter attribute. If the parameter is op­ DW_AT_encoding = signed operations used to describe data tional or has a default value, these are rep­ <5>: DW_TAG_variable which is split up and stored in resented by attributes. The DIEs for the pa­ DW_AT_name = a different locations, such as a rameters are in the same order as the argu­ DW_AT_type = <4> structure where some data are DW_AT_external = 1 ment list for the function, but there may be DW_AT_location = (DW_OP_addr: 0) stored in memory and some are additional DIEs interspersed, for example, stored in registers. to define types used by the parameters. Figure 7. DWARF description of variables Although this great flexibili­ a, b, and c. A function may define variables that ty is seldom used in practice, may be local or global. The DIEs for these the location expression should variables follow the parameter DIEs. Many allow the location of a variable's in memory and has a fixed address7. But languages allow nesting of lexical blocks. data to be described no matter how com­ many variables, such as those declared These are represented by lexical block DIEs plex the language definition or how clever within a C function, are dynamically allo­ which in turn, may own variable DIEs or the compiler's optimizations. cated and locating them requires some nested lexical block DIEs. (usually simple) computation. For example, Here is a somewhat longer example. a local variable may be allocated on the Describing Figure 8a shows the source for stack, and locating it may be as simple as Executable Code strndup.c, a function in gcc that dupli­ cates a string. Figure 8b lists the DWARF 7 Well, maybe not a fixed address, but one that is Functions and Subprograms generated for this file. As in previous exam­ a fixed offset from where the executable is load­ ples, the source line information and the lo­ ed. The loader relocates references to addresses WARF treats functions that return val­ cation attributes are not shown. within an executable so that at run­time the loca­ Dues and subroutines that do not as tion attribute contains the actual memory ad­ variations of the same thing. Drifting slight­ In Figure 8b, DIE shows the defi ­ dress. In an object file, the location attribute is ly away from its roots in C terminology, nition of size_t which is a typdef of un- the offset, along with an appropriate DWARF describes both with a subprogram signed int. This allows a debugger to table entry.

Introduction to the DWARF Debugging Format 6 Michael J. Eager display the type of formal argument n as a Compilation Unit particular memory address. If the compila­ size_t, while displaying its value as an tion unit is not contiguous, then a list of the ost interesting programs consists of unsigned integer. DIE describes the memory addresses that the code occupies is more than a single file. Each source function . This has a pointer to its M provided by the compiler and linker. strndup file that makes up a program is compiled sibling, DIE <10>; all of the following independently and then linked together The Compilation Unit DIE is the parent DIEs are children of the Subprogram DIE. with system libraries to make up the pro­ of all of the DIEs that describe the compila­ The function returns a pointer to char, de­ gram. DWARF calls each separately com­ tion unit. Generally, the first DIEs will de­ scribed in DIE <10>. DIE also de ­ piled source file a compilation unit. scribe data types, followed by global data, scribes the subroutine as external and pro­ then the functions that make up the source The DWARF data for each compilation totyped and gives the low and high PC val­ file. The DIEs for variables and functions unit starts with a Compilation Unit DIE. ues for the routine. The formal parameters are in the same order in which they appear This DIE contains general information and local variables of the routine are de­ in the source file. scribed in DIEs <6> to <9>. about the compilation, including the direc­ tory and name of the source file, the pro­ Data encoding strndup.c: gramming language 1: #include "ansidecl.h" onceptually, the DWARF data that de­ 2: #include used, a string which Cscribes a program is a tree. Each DIE 3: identifies the produc­ may have a sibling and maybe several chil­ 4: extern size_t strlen (const char*); er of the DWARF dren DIEs. Each of the DIEs has a type 5: extern PTR malloc (size_t); data, and offsets into (called its TAG) and a number of attributes. 6: extern PTR memcpy (PTR, const PTR, size_t); the DWARF data sec­ Each attributes is represented by a attribute 7: tions to help locate 8: char * type and a value. Unfortunately, this is not 9: strndup (const char *s, size_t n) the line number and a very dense encoding. Without compres­ 10: { macro information. sion, the DWARF data is unwieldy. 11: char *result; 12: size_t len = strlen (s); If the compilation DWARF offers several ways to reduce 13: unit is contiguous the size of the data which needs to be saved 14: if (n < len) (i.e., it is loaded into with the object file. The first is to "flatten" 15: len = n; memory in one piece) the tree by saving it in prefix order. Each 16: then there are values type of DIE is defined to either have chil­ 17: result = (char *) malloc (len + 1); for the low and high 18: if (!result) dren or not. If the DIE cannot have chil­ 19: return 0; memory addresses for dren, the next DIE is its sibling. If the DIE 20: the unit. This makes it can have children, then the next DIE is its 21: result[len] = '\0'; easier for a debugger first child. The remaining children are rep­ 22: return (char *) memcpy (result, s, len); to identify which resented as the siblings of this first child. 23: } compilation unit cre­ This way, links to the sibling or child DIEs Figure 8a. Source for strndup.c. ated the code at a can be eliminated. If the compiler writer thinks that it might be useful to be able to jump from one DIE to its sibling without stepping through each of its children DIEs <1>: DW_TAG_base_type <7>: DW_TAG_formal_parameter DW_AT_name = int DW_AT_name = n (for example, to jump to the next function DW_AT_byte_size = 4 DW_AT_type = <2> in a compilation) then a sibling attribute DW_AT_encoding = signed DW_AT_location = can be added to the DIE. <2>: DW_TAG_typedef (DW_OP_fbreg: 4) DW_AT_name = size_t <8>: DW_TAG_variable A second scheme to compress the data DW_AT_type = <3> DW_AT_name = result is to use abbreviations. Although DWARF <3>: DW_TAG_base_type DW_AT_type = <10> allows great flexibility in which DIEs and DW_AT_name = unsigned int DW_AT_location = attributes it may generate, most compilers DW_AT_byte_size = 4 (DW_OP_fbreg: -28) only generate a limited set of DIEs, all of DW_AT_encoding = unsigned <9>: DW_TAG_variable <4>: DW_TAG_base_type DW_AT_name = len which have the same set of attributes. In­ DW_AT_name = long int DW_AT_type = <2> stead of storing the value of the TAG and DW_AT_byte_size = 4 DW_AT_location = the attribute­value pairs, only an index into DW_AT_encoding = signed (DW_OP_fbreg: -24) a table of abbreviations is stored, followed <5>: DW_TAG_subprogram <10>: DW_TAG_pointer_type by the attribute codes. Each abbreviation DW_AT_sibling = <10> DW_AT_byte_size = 4 gives the TAG value, a flag indicating DW_AT_external = 1 DW_AT_type = <11> DW_AT_name = strndup <11>: DW_TAG_base_type whether the DIE has children, and a list of DW_AT_prototyped = 1 DW_AT_name = char attributes with the type of value it expects. DW_AT_type = <10> DW_AT_byte_size = 1 Figure 9 shows the abbreviation for the for­ DW_AT_low_pc = 0 DW_AT_encoding = mal parameter DIE used in Figure 8b. DIE DW_AT_high_pc = 0x7b signed char in Figure 8 is actually encoded as <6>: DW_TAG_formal_parameter <12>: DW_TAG_pointer_type shown8. This is a significant reduction in DW_AT_name = s DW_AT_byte_size = 4 DW_AT_type = <12> DW_AT_type = <13> the amount of data that needs to be saved DW_AT_location = <13>: DW_TAG_const_type at some expense in added complexity. (DW_OP_fbreg: 0) DW_AT_type = <11> Figure 8b. DWARF description for strndup.c. 8The encoded entry also includes the file and line values which are not shown in Fig. 8b.

Introduction to the DWARF Debugging Format 7 Michael J. Eager ment, the end of the function prolog, or the Abbrev 5: DW_TAG_formal_parameter [no children] start of the function epilog. A set of special DW_AT_name DW_FORM_string DW_AT_decl_file DW_FORM_data1 opcodes combine the most common opera­ DW_AT_decl_line DW_FORM_data1 tions (incrementing the memory address DW_AT_type DW_FORM_ref4 and either incrementing or decrementing DW_AT_location DW_FORM_block1 the source line number) into a single op­ code. ┌──────────────────────────────── abbreviation 5 │ ┌──────────────────────────── ”s” Finally, if a row of the line number ta­ │ │ ┌──────────────────────── file 1 ble has the same source triplet as the previ­ │ │ │ ┌───────────────────── line 41 ous row, then no instructions are generated │ │ │ │ ┌──────────────── type DIE offset for this row in the line number program. │ │ │ │ │ ┌───────── location (fbreg + 0) Figure 10 lists the line number program for │ │ │ │ │ │ ┌───── terminating NUL strndup.c. Notice that only the machine 05 7300 01 29 0000010c 9100 00 addresses that represent the beginning in­ Figure 9. Abbreviation entry and encoded form. struction of a statement are stored. The compiler did not identify the basic blocks in this code, the end of the prolog or the start Less commonly used are features of arguments to a function have been loaded of the epilog to the function. This table is DWARF Version 3 and 4 which allow refer­ or before the function returns. Some pro­ encoded in just 31 bytes in the line number ences from one compilation unit to the cessors can execute more than one instruc­ program. DWARF data stored in another compilation tion set, so there is another column that in­ unit or in a shared . Many compilers dicates which generate the same abbreviation table and set is stored at base types for every compilation, indepen­ the specified dent of whether the compilation actually machine loca­ uses all of the abbreviations or types. These tion. Address File Line Col Stmt Block End Prolog Epilog ISA can be saved in a shared library and refer­ As you enced by each compilation unit, rather than might imag­ being duplicated in each. 0x0 0 42 0 yes no no no no 0 ine, if this ta­ 0x9 0 44 0 yes no no no no 0 ble were 0x1a 0 44 0 yes no no no no 0 Other DWARF Data stored with 0x24 0 46 0 yes no no no no 0 one row for 0x2c 0 47 0 yes no no no no 0 Line Number Table each machine 0x32 0 49 0 yes no no no no 0 instruction, it 0x41 0 50 0 yes no no no no 0 he DWARF line table contains the map­ would be 0x47 0 51 0 yes no no no no 0 Tping between memory addresses that huge. DWARF 0x50 0 53 0 yes no no no no 0 contain the executable code of a program compresses 0x59 0 54 0 yes no no no no 0 and the source lines that correspond to this data by 0x6a 0 54 0 yes no no no no 0 0x73 0 55 0 yes no no no no 0 these addresses. In the simplest form, this encoding it as 0x7b 0 56 0 yes no yes no no 0 can be looked at as a matrix with one col­ sequence of File 0: strndup.c umn containing the memory addresses and instructions File 1: another column containing the source called a line stddef.h triplet (file, line, and column) for that ad­ number pro­ dress. If you want to set a breakpoint at a gram9. These Figure 10. Line Number Table for strndup.c. particular line, the table gives you the instructions memory address to store the breakpoint in­ are interpret­ struction. Conversely, if your program has a ed by a simple finite state machine to recre­ Macro Information fault (say, using a bad pointer) at some lo­ ate the complete line number table. ost debuggers have a very difficult cation in memory, you can look for the time displaying and debugging code The finite state machine is initialized M source line that is closest to the memory which has macros. The user sees the origi­ with a set of default values. Each row in the address. nal source file, with the macros, while the line number table is generated by executing code corresponds to whatever the macros DWARF has extended this with added one or more of the opcodes of the line generated. columns to convey additional information number program. The opcodes are general­ about a program. As a compiler optimizes ly quite simple: for example, add a value to DWARF includes the description of the the program, it may move instructions either the machine address or to the line macros defined in the program. This is around or remove them. The code for a giv­ number, set the column number, or set a quite rudimentary information, but can be en source statement may not be stored as a flag which indicates that the memory ad­ used by a debugger to display the values sequence of machine instructions, but may dress represents the start of an source state­ for a macro or possibly translate the macro be scattered and interleaved with the in­ into the corresponding source language. structions for other nearby source state­ 9 ments. It may be useful to identify the end Calling this a line number program is some­ thing of a misnomer. The program describes of the code which represents the prolog of Call Frame Information much more than just line numbers, such as in­ a function or the beginning of the epilog, so struction set, beginning of basic blocks, end of very processor has a certain way of that the debugger can stop after all of the function prolog, etc. Ecalling functions and passing argu­

Introduction to the DWARF Debugging Format 8 Michael J. Eager ments, usually defined in the ABI. In the resented in only a few bits, this means that nize compilations which define the same simplest case, this is the same for each the data consists mostly of zeros10. type units and eliminate the duplicates. function and the debugger knows exactly DWARF defines a variable length inte­ how to find the argument values and the ger, called Little Endian Base 128 (LEB128 ELF sections return address for the function. or more commonly ULEB for unsigned val­ hile DWARF is defined in a way that For some processors, there may be dif­ ues and SLEB for signed values), which Wallows it to be used with any object ferent calling sequences depending on how compresses these integer values. Since the file format, it's most often used with ELF. the function is written, for example, if there low­order bits contain the data and high­or­ Each of the different kinds of DWARF data are more than a certain number of argu­ der bits consist of all zeros or ones, LEB val­ are stored in their own section. The names ments. There may be different calling se­ ues chop off the low­order seven bits of the of these sections all start with ".debug_". quences depending on operating systems. value. If the remaining bits are all zero or For improved efficiency, most references to Compilers will try to optimize the calling one (sign­extension bits), this is the encod­ DWARF data use an offset from the start of sequence to make code both smaller and ed value. Otherwise, set the high­order bit the data for the current compilation. This faster. One common optimization is having to one, output this byte, and go on to the avoids the need to relocate the debugging a simple function which doesn't call any next seven low­order bits. data, which speeds up program loading and others (a leaf function) use its caller stack debugging. frame instead of creating its own. Another Shrinking DWARF data optimization may be to eliminate a register The ELF sections and their contents are The encoding schemes used by DWARF which points to the current call frame. .debug_abbrev significantly reduce the size of the debug­ Abbreviations used in the Some registers may be preserved across the .debug_info section call while others are not. While it may be ging information compared to an unencod­ ed format like DWARF Version 1. Unfortu­ .debug_aranges A mapping between possible for the debugger to puzzle out all memory address and the possible permutations in calling se­ nately, with many programs the amount of debugging data generated by the compiler compilation quence or optimizations, it is both tedious .debug_frame Call Frame Information and error­prone. A small change in the op­ can become quite large, frequently much timizations and the debugger may no larger than the executable code and data. .debug_info longer be able to walk the stack to the call­ The core DWARF data DWARF offers ways to further reduce containing DIEs ing function. the size of the debugging data. Most strings .debug_line Line Number Program The DWARF Call Frame Information in the DWARF debugging data are actually (CFI) provides the debugger with enough references into a separate .debug_str .debug_loc Location descriptions information about how a function is called section. Duplicate strings can be eliminat­ so that it can locate each of the arguments ed when generating this section. Potential­ .debug_macinfo to the function, locate the current call ly, a linker can merge the .debug_str Macro descriptions frame, and locate the call frame for the sections from several compilations into a calling function. This information is used by single, smaller string section. .debug_pubnames A lookup table for global objects and functions the debugger to "unwind the stack," locat­ Many programs contain declarations .debug_pubtypes ing the previous function, the location which are duplicated in each compilation A lookup table for global where the function was called, and the val­ unit. For example, debugging data describ­ types ues passed. ing many (perhaps thousands) declarations .debug_ranges Address ranges referenced by DIEs Like the line number table, the CFI is of C++ template functions may be repeated encoded as a sequence of instructions that in each compilation. These repeated de­ .debug_str String table used by are interpreted to generate a table. There is scriptions can be saved in separate compila­ .debug_info one row in this table for each address that tion units in uniquely named sections. The .debug_types Type descriptions contains code. The first column contains linker can use COMDAT (common data) the machine address while the subsequent techniques to eliminate the duplicate sec­ columns contain the values of the machine tions. registers when the instruction at that ad­ Many programs reference a large num­ dress is executed. Like the line number ta­ ber of include files which contain many ble, if this table were actually created it type definitions, resulting in DWARF data Summary would be huge. Luckily, very little changes which contains thousands of DIEs for these o there you have it ─ DWARF in a nut­ between two machine instructions, so the types. A compiler can reduce the size of CFI encoding is quite compact. Sshell. Well, not quite a nutshell. The ba­ this data by only generating DWARF for the sic concepts for the DWARF debug informa­ types which are actually used in the compi­ tion are straight­forward. A program is de­ Variable length data lation. With DWARF Version 4, type defini­ scribed as a tree with nodes representing Integer values are used throughout tions can be saved into a separate the various functions, data and types in the DWARF to represent everything from off­ .debug_types section. The compilation source in a compact language­ and ma­ sets into data sections to sizes of arrays or unit contains a DIE which references this chine­independent fashion. The line table structures. In most cases, it isn't possible to separate type unit and a unique 64­bit sig­ provides the mapping between the exe­ place a bound on the size of these values. nature for these types. A linker can recog­ cutable instructions and the source that In a classic data structure each of these val­ 10 An example of this may be seen in the reloca­ generated them. The CFI describes how to ues would be represented using the default tion directory in an object file, where file offset unwind the stack. integer size. Since most values can be rep­ and relocation values are represented by inte­ gers. Most values are have leading zeros.

Introduction to the DWARF Debugging Format 9 Michael J. Eager There is quite a bit of subtlety in The complete DWARF Version 4 Stan­ Acknowledgements DWARF as well, given that it needs to ex­ dard is available for download without press the many different nuances for a wide charge at the DWARF website (dwarfst- I want to thank Chris Quenelle of Sun range of programming languages and dif­ d.org). There is also a mailing list for Microsystems and Ron Brender, formerly of ferent machine architectures. Future direc­ questions and discussion about DWARF. HP, for their comments and advice about a tions for DWARF are to improve the de­ Instructions on registering for the mailing previous version of this paper. Thanks also scription of optimized code so that debug­ list are also on the website. to Susan Heimlich for her many editorial gers can better navigate the code which ad­ comments. vanced compiler optimizations generate.

Generating DWARF with GCC It’s very simple to generate DWARF with gcc. Simply specify the –g option to generate debugging information. The ELF sections can be displayed using objump with the –h option. $ gcc –g –c strndup.c $ objdump –h strndup.o strndup.o: file format elf32-i386

Sections: Idx Name Size VMA LMA File off Algn 0 .text 0000007b 00000000 00000000 00000034 2**2 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE 1 .data 00000000 00000000 00000000 000000b0 2**2 CONTENTS, ALLOC, LOAD, DATA 2 .bss 00000000 00000000 00000000 000000b0 2**2 ALLOC 3 .debug_abbrev 00000073 00000000 00000000 000000b0 2**0 CONTENTS, READONLY, DEBUGGING 4 .debug_info 00000118 00000000 00000000 00000123 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 5 .debug_line 00000080 00000000 00000000 0000023b 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 6 .debug_frame 00000034 00000000 00000000 000002bc 2**2 CONTENTS, RELOC, READONLY, DEBUGGING 7 .debug_loc 0000002c 00000000 00000000 000002f0 2**0 CONTENTS, READONLY, DEBUGGING 8 .debug_pubnames 0000001e 00000000 00000000 0000031c 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 9 .debug_aranges 00000020 00000000 00000000 0000033a 2**0 CONTENTS, RELOC, READONLY, DEBUGGING 10 .comment 0000002a 00000000 00000000 0000035a 2**0 CONTENTS, READONLY 11 .note.GNU-stack 00000000 00000000 00000000 00000384 2**0 CONTENTS, READONLY

Printing DWARF using Readelf Readelf can display and decode the DWARF data in an object or executable file. The options are -w ─ display all DWARF sections -w[liaprmfFso] ─ display specific sections l ─ line table i ─ debug info a ─ abbreviation table p ─ public names r ─ ranges m ─ macro table f ─ debug frame (encoded) F ─ debug frame (decoded) s ─ string table o ─ location lists

The DWARF listing for all but the smallest programs is quite voluminous, so it would be a good idea to direct readelf’s output to a file and then browse the file with less or an editor such as vi.

Introduction to the DWARF Debugging Format 10 Michael J. Eager Introduction to the DWARF Debugging Format 11 Michael J. Eager