<<

I

PROGRAMMING LANGUAGES & TOOLS

Volume 10 Number 1 1998 Editorial The Digital Technicaljoumalis a refereed AlphaServer, Compaq, tl1e Compaq logo, jane . Blake, Managing Editor journal published quarterly by Compaq DEC, DIGITAL, tl1e DIGITAL logo, 550 ULTIUX, Kathleen M. Stetson, Editor Computer Corporation, King Street, VAX,and VMS are registered 01460-1289. Hden L. Patterson, Editor LKGI-2jW7, Littleton, MA in the U.S. Patent and Trademark Office.

Hard-copy subscriptions can be ordered by DIGITAL , FX132, and OpenVMS Circulation sending a check in U.S. funds (made payable arc trademarks of Compaq Computer Kristine M. Lowe, Administrator to Compaq Computer Corporation) to the Corporation. published-by address. General subscription Production rates arc $40.00 (non-U.S. $60) for four issues Intel and Pentium are registered u·ademarks $75.00 $115) Christa W. Jessica, Production Editor and (non-U.S. for eight issues. of Intel Corporation. University and college professors and Ph.D. Elizabeth McGrail, Typographer I lUX is a registered trademark of Silicon students in the elecu·icaJ engineering and com­ Peter R. Woodbury, Illustrator Graphics, Inc. puter science fields receive complimentary sub­ scriptions upon request. Compaq customers , Visual C++, Windows, and Advisory Board may qualify tor giftsubscriptions and arc encour­ Windows NT are registered trademarks Thomas F. Gannon, Chairman (Acting) aged to contact tl1eir sales representatives. of Microsoft Corporation. Scott E. Cutler Donald Z. Harbert Electronic subscriptions are available at MIPS is a registered trademark of MIPS William A. Laing no charge by accessing URL Technologies, Inc. Richard F. Lary http:jjwww.digital.com/subscription. NULLSTONE is a trademark ofNullstonc Alan G. Nemeth This service will send an electronic mail notification when a new issue is available Corporation. Robcrt M. Supnik on the Internet. Roque Wave and .h++ are registered trade­ Single copies and back issues can be ordered marks of Roque Wave Software, Inc. by sending tl1e requested issue's volume and RS/6000 is a registered trademark of number and a check for $16.00 (non-U.S. International Business Machines C01voration. $18) each to tl1e published-by address. Recent issues arc also available on me Internet at Solaris is a registered trademark of Sun http://www.digital.com/ dtj. Microsysrems, Inc.

Compaq employees may order subscrip­ SPARC is a registered trademark of SPARC tions through Readers Choice at URL International,Inc. http://web rc.das.dec.com. SPEC and SPECint are registered trademarks Inquiries, address changes, and compli­ of Standard Performance Evaluation mentary subscription orders can be sent Corporation. to the Dlj!,ital Technica/Joumal at tl1e published-by address or tl1e electronic UNIX is a registered trademark in the United mail address, [email protected]. Inquiries States and in other countries, licensed exclu­ can also be made by calling U1e.fournal sively through X/Open Company Ltd. office at 978-506-6858. Other product and company names mentioned Comments on the content of any paper and herein may be trademarks and/or registered requests to contact autl1ors are welcomed trademarks of their respective owners. and may be sent to tl1e managing editor at tl1e published-by or electronic mail address.

Copyright© 1998 Compaq Computer Corporation. Copying wimout fee is per­ mitted provided that such copies are made f·or usc in educational institutions by faculty members and are not distributed for com­ mercial advantage. Absu·acting with credit of Compaq Computer Corporation's author­ ship is permitted.

The information in tl1e journalis subject to change without notice and should not be construed as a commitment by Compaq Computer Corporation or by the compa­ nies herein represented. Compaq Computer Corporation assumes no responsibility for any errors that may appear in t11e./OII/'I/Cii.

ISSN 0898-90IX

Documentation Number EC-P9706-I8

Book production was done by Quantic Communications, Inc.

Cover Design This special issue of the jounw/ focuses on Programming Languages & Tools, specifi­ cally on software. For the cover, wc have chosen the alchemist who trans­ forms common elements into precious gold to represent the compiler developer who transforms code to extract the highest per­ formance possible for software applications.

The cover was designed by Lucinda O'Neill of the Compaq Industria! and Graphic Design Group. December 1998

A letter to readers of the Dip,ital Technicaljournal

This issue is the last Digital Technicaljournal to be published. Since 1985, the Journal has been privileged to publish intormation about significantengineeting accomplishments for DIGITAL, including standards-setting network and storage teclmologies, industry-leading VAX.systems, record-breaking Alpha microproces­ sors and semiconductor technologies, and advanced application software and performance tools. The Journal has been rewarded by continual growth in rhe number of readers and by rheir expressions of appreciation for the quality of content a.nd presentation.

The editors dunk rhe engineers who somehow made d1e time to write, the engi­ neering managers who supported rhem, rhe consulting engineers and professors who reviewed manuscripts and made rhe process a learning experience for all of us, and, of course, the readers who are the reason the Journal came into existence 13 years ago.

With kind regards,

Jane Blake Managing Editor

Kathleen Stetson Editor

Helen Patterson Editor Digital Technical Journal Volume 10 Number 1

Contents

Introduction C. Robert Morgan, Guest Editor 2

Foreword William C. Blake 4

Tracing and Characterization of Windows NT-based Jason P. Casmira, David P. Hunter, 6 System Workloads and David R. Kaeli

Automatic Template Instantiation in DIGITAL C++ Avrum E. Itzkowitz and Lois D. Foltan 22

Measurement and Analysis of C and C++ Performance Hemant G. Rotithor, Kevin W. Harris, 32 and Mark W. Davis

Alias Analysis in the DEC C and DIGITAL C++ August G. Reinig 48

Compiler Optimization for Superscalar Systems: P hilip H. Sweany, Steven M. Carr, 58 Global Instruction Scheduling without Copies and Brett L. Huber

Maximizing Multiprocessor Performance Mary W. Hall, Jennifer M. Anderson, 71 with the SUIF Compiler Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Eduoard Bugnion, :md Monica S. Lam

Debugging Optimized Code: Concepts and Ronald F. Brender, Jeffrey E. Nelson, 81 Implementation on DIGITAL Alpha Systems and Mark E. Arsenault

Differential Testing for Software William M. McKeeman 100 Introduction

The complexity of high-performance Profiling describes the point in the systems and d1e need tor ever-increased program that is most frequently performance to be gained from those executed. Tracing describes the systems creates a challenge for engi­ commonly executed sequence of neers, one d1at requires bod1 experience instructions. In addition to helping and innovation in the development developers build more efficient of software tools. The papers in this applications, this information assists

issue of tJ1e]ournal are a few selected designers and implementers of future examples of the work performed Windows NT systems. within Compaq and by researchers Every compiler consists of two C. Robert Morgan Senior Consulting Engineer and worldwide to advance me state of me components: the front end, which Technical Program Manage1; art. In fact, Compaq supports rele­ analyzes the specific language, and Core Technology Croup vant research in programming lan­ the back end, which generates opti­ guages and tools. mized instructions for the target Compaq has been developing machine. An efficient compiler is a high-performance tools for more balance of both components. As lan­ than thirty years, starting with the guages such as C++ evolve, the com­ compiler for the DIGITAL piler front end must also evolve to PDP-10, introduced in 1967. Later keep pace. C++ has now been stan­ compilers and tools for VAX com­ dardized, so evolutionary changes puter systems, introduced in 1977, will Jessen. However, compiler devel­ made the VA.'< system one of me most opers must continue to improve usable in history. The compilers and front-end techniques for implement­ debugger for VAXjVMS are exem­ ing the language to ensure ever better plary. With the introduction of the application performance. An impor­ VfuY.. successor in 1992, the 64-bit tant feature of C++ compiler develop­ RISC Alpha systems, Compaq has ment is C++ templates. Templates continued me traditionof developing may be implemented in multiple advanced tools that accelerate appli­ ways, with varying effects on appli­ cation performance and usability for cation programs. The paper by system users. The papers, however, Itzkowitz and Foltan describes represent not only the work of Compaq's efficient implementation Compaq engineers but aJso that of of templates. On a related subject, researchers and academics who are Rotid1or, Hanis, and Davis describe working on problems and advanced a systematic approach Compaq has techniques of interest to Compaq. developed for monitoring and The paper on cbaractetization of improving C++ compiler perfor­ system workloads by Casmira, Hw1ter, mance to minimize cost and maxi­ and Kaeli addresses the capture of mize function and reliability. basic data needed for me development Improved optimization techniques of tools and high-performance appli­ for compiler back ends are presented

cations. The authors' work focuses in three papers. In the firsto f d1ese, on generating accurate profile and Reinig addresses the requirement in trace data on machines running the an optimizing compiler for an accu­ Windows NT . rate description of the variables and

2 Digital Technical Journal Vol. 10 No.1 1998 fields that may be changed by an by Brender, Nelson, and Arsenault assignment operation, and describes reports an advanced developmt:nt an efficient technique used in the project at Compaq to provide tech­ C/C++ compilers for gathering this niques for the debugger to discover information. Sweany, Carr, and Huber a more accurate image of the state of describe techniques for increasing the program. These techniques are execution speed in processors like currently being added to Compaq the Alpha that issue multiple instruc­ de buggers. tions simultaneously. The technique One of the problems that tool reorders the instructions in the pro­ developers face is increasing tool reli­ gram to increase the number of ability. Tool developers, therefore, instructions that are simultaneously test the code. However, developers issued. Maximizing the performance are often biased; they know how their of multiprocessor systems is the sub­ programs operate, and they test cer­ ject of the paper by Hall et al., which tain aspects of the code butnot oth­ was previously published in IEEE ers. The paper by McKeeman describes Computer and updated with an a technique called differential testing addendum for this issue. The authors that generates correct random tests of describe the SUIF compiler, which tools such as compilers. The random represents some of the best research nature of the tests removes the devel­ in this area and has become the basis opers' bias. The tool can be used for of one part of the ARPA compiler two purposes: to improve existing infrastructure project. Compaq tools and to compare the reliability assisted researchers by providing the of competitive tools. DIGITAL Fortran compiler fi-ontend The High Performance Technical and an AJphaServer 8400 system. Computing Group and the Core As compilers become more effec­ Technology Group within Compaq tive in increasing application program are pleased to help develop this issue

performance, the ability to debug of the]ournal. Studying the work the programs becomes more difficult. performed within Compaq and by The difficulty arises because the other researchers worldwide is one compiler gains efficiency by reorder­ way tlut we remain at the cutting ing and eliminating instructions. edge of technology of programming Consequently, the instructions for language, compiler, and program­ an application program are not easiJy ming tool research. identifiable as part of any particular statement. The debugger cannot always report to the application pro­ gram where variables are stored or what statement is currently being executed. Application programmers have two choices: Debug an unopti­ mized version of the program or find some other technique for determining the state of the program. The paper

Digital Technical Journal Vol. 10 No. I 1998 3 Foreword

You might think that the cover of this the high-level programs close to the issue of the Digital Tecbnicaljournal programmer must be correctly com­ is a bit odd. Afterall, what could be piled into those instructions. This the relevance of those ancient alchemists semantic gap between programming in the drawing to the computer-age languages and machine instructions is topic of programming languages and central to the evolution of compilers tools? Certainly, both alchemists and and to microprocessor architectures programmers work busily on new as well. The compiler developer's role tools. An even more interesting is to help close tbe gap by preserving metaphorical connection is the the correctness of the compilation William C. Blake Director, HighPerformance alchemist and the compiler software and at the same time resolving the Technical Computing and developer as creators of tools that trade-offs between the optimizations Core Technologv Gruups transform (transmute, in the strict needed tor improvements "close to sense of alchemy) tbe base into the the programmer" and those needed precious. The metaphor does, how­ "close to the machine." ever, break down. Unlike the mytl1 To put the work described in tl1is and folklore of alchemy, the science journal into context, it is helptl.IJ to and technology of compiler software think about the changes in compiler development is a real and important requirements over tl1e past 15 years. part of processing a new solution or It was in the early 1980s that the direc­ algorithm into the correct and high­ tion of future computer architectures est performance set of actual machine changed rrom increasingly complex instructions. This issue of tl1ejournal instruction sets, CISC, that supported addresses current, state-of-the-art high-level languages to computer work at Compaq Computer Corp­ architectures with much simpler, oration on programming languages reduced instruction sets, RJSC. Three and tools. key research efforts led the way: the Gone are the days when program­ Berkeley RJSC processor, the IBM mers plied their craft "close to the 801 RISCprocessor, and the Stanford machine," tlut is, working in detailed MIPS processor. Nl three approaches machine instructions. Today, system dramatically reduced the instruction designers and application developers, set and increased the clock rate. The driven by the pressures of time to RISCapproach promised improve­ market and technical complexity, ments up to a factor of five compared must express their solutions in terms witl1 CISC machines using the same "close to the programmer" because manufacturing technology. Compaq's people think best in ways that are transition rrom the VAXto the Npha abstract, language dependent, and 64-bit RISC architecture was a direct machine independent. Enhancing result of the new architectural trend. the characteristics of an abstract As a consequence of these major high-level language, however, con­ architectural changes, compilers and flicts with the need tor lower level their associated tools became signifi­ optimizations tl1at make tl1e code cantly more important. New, much run f:1stest. Computers still require more complex compilers for RISC detailed machine instructions, and machines eliminated the need tor the

4 Digital Technical Journal Vol. 10 No. I 1998 large, microcoded CISC machines. speedup enhancements. In the next The complexities of high-level lan­ 15 years, Moore's Law may be stopped guage processing moved from the by the physical reali6es of scaling lim­ petritied software of CISC micro­ its. But Amdahl's Law will be broken processors to a whole new generation as well, as improvements in parallel of optimizing compilers. This move language, tool development, and new caused some to claim that ruse really methods of achieving parallelism wiU stands for "Relegate Important Stuff posi6vely affect the future of compil­ to Compilers." ers and hence applicationperformance. The introduction of the third-gen­ As you will see in papers in this issue, eration Alpha microprocessor, the there is a new emphasis on increasing 21264, demonstrates that the shift to execution speed by exploiting the ruse and AJpha system implementa­ multiple instruction issue capability of tions and compilers served Compaq AJpha microprocessors. Improvements customers well by producing reliable, in execu6on speed will accelerate dra­ accurate, and high-performance com­ matically as future compilers exploit puters. In fact, AJpha systems, which performance improvement techniques have the ability to process over a bil­ using new capabilities evolvedin AJpha. lion 64-bit floating-point numbers Compilers will deliver new ways of per second, pertorm at levels formerly hiding instruc6on latency (reducing attained only by specialized super­ the pertormance gap bel:\veenvector computers. It is not surprising that processors and IUSC superscalar the AJpha microprocessor is the most machines), improved unrolling and frequendy used microprocessor in the optimization of loops, instruction top 500 largest supercomputing sites reordering and scheduling, and ways in the world. of dealing with parallel decomposi- After reading through the papers 6on and data layout in nonuniform in this issue, you may wonder what is memory architectures. The challenges next for compilers and tools. As phys­ to compiler and tool developers will ical limits curtail the shrinking of sili­ undoubtedly increase over 6me. con feature sizes, there is not likely to By not relying on hardware be a repeat of the performance gains improvements to deliver all the at the microprocessor level, so atten­ increases in performance, compiler tion will turn to compiler technology wizards are making their own contri­ and computer architecture to deliver butions- always watchful of correct­ the next thousandfold increase in sus­ ness first, d1en run-time performance, tained application pertormance. The and, finally, speed and efficiency of the two principal laws that atfect drama6c software development process itself application pertormance improve­ ments are Moore's Law and Amdahl's Law. Moore's Law states d1at perfor­ mance will double each 18 months due to semiconductor process scaling; and Amdahl's Law expresses the diminishing returns of various system

Digital Technical Journal Vol. 10 No. 1 1998 5 I

Jason P. Cas mira David P. Htmter David R. Kaeli Tracing and Characterization of Windows NT-based System Workloads

To optimize the design of pipelines, branch pre­ The computer architecture research communiry com­ dictors, and cache memories, computer archi­ monly uses trace-driven simulation in pursuing tects study the characteristics of benchmark answers to a variety of design issues. Architects spend a significantamou nt of ti me studying the characteristics programs by examining traces, i.e., samples of of benchmark programs by examining traces, i.e., sam­ program execution. Since commercial desktop ples taken fr om program execution. Popular bench­ applications are increasingly dependent on ser­ mark programs include the SPEC' and the BYTEmark2 vices and application programming interfaces benchmark test suites. Since the underlying assump­ provided by the host operating system, the tion is that these programs generate workloads that authors argue that traces from benchmark exe­ represent user applications, today's computer designs have been optimized based on the cl1aracteristics of cution must capture operating system execution these benchmark programs. in addition to native application execution. Although the authors of popular benchmarks arc Common benchmark-based workloads, how­ well intentioned, the resulting workloads lack operat­ ever, lack operating system execution. This ing system execution and consequently do not repre­ paper discusses the ongoing joint efforts of the sent some of the most prevalent desktop applications, C/C++, Northeastern University Computer Architecture e.g., Microsoft Wo rd , Microsoft Visual and Microsoft Internet Explorer. Such applications make Research Laboratory and Compaq Computer heavy use of application programming inted:1ces Corporation's Advanced and Emerging Tech­ (APis), which in turn execute many instructions in the nologies Advanced Development Group to cap­ operating system. As a result, the overall performance ture operating system-rich traces on Alpha­ of many desktop applications depends on efficient based machines running the Windows NT oper­ operating system interaction. Clearly operating system overhead can greatly reduce the benefits of a new ating system. The authors describe the latest computer design fe ature. Past architectural studies, PatchWrx software toolset and demonstrate its however, have generally ignored operating system trace-generating capabilities by characterizing interaction because fe w tools can generate operating numerous applications. Included is a discussion system-rich traces. of the fundamental differences between using This paper discusses the ongoing joint ef forts of traces captured from common benchmark pro­ Northeastern University and Compaq Computer Corporation to capture operating system-rich traces on grams and using those captured on commercial DIGITAL Alpha-based machines running the Microsoft desktop applications. The data presented Windo>vs NT operating system. We argue th:tt toru·aces demonstrates that operating system execution of today's workloads to be accurate, they must capture can dominate the overall execution time of the operatingsystem executionas well as the native appli­ desktop applications such as Microsoft Word, cation execution. This need to capture complete pro­ Microsoft Visual C/C++, and Microsoft Internet gram u·ace information has been a dtiving fen-ce behind the development and use of software tools such as the Explorer and that the characteristics of the PatchWrx dynamic execution-tracing toolset, which we operating system instruction stream can be desctibe in this paper. quite different from those typically found in The PatchvVrx toolset was origi nally developed by benchmarking workloads. Sites and Perl at Digital Equipment Corporation's Systems Research Center. They described P:ttchWrx, as developed fo r vVindows NT version 3.5, in "Studies of

6 Digital Technic� I journal Vol. 10 No. l 1998 Windows NT Performance Using Dynamic Execution fa r fr om complete, this list provides a sample of the Traces."> The Northeastern University Computer tools that have been used to generate input to a variety Architecture Research Laboratory and Co mpaq's oftrace-driven simulation studies. 'vVe have ­ Advanced and Emerging Technologies Advanced ized each tool in terms of the three issues (criteria) pre­ Development Group continue to develop the toolset. viously mentioned. Table llists the target plattorm(s) We have updated the fra mework to operate under fo r each tracing tool. Windows NT version 4.0, added the ability to trace Note that many of these tools cannot capture oper­ programs that have code sections larger than 4 mega­ ating system activity. For those that can, their associ­ bytes (MB), added multiple trace buffe r sizes, and ated slowdown can significantly affect the accuracy of developed additional postprocessing tools. the captured trace . Of the tools that provide this capa­ After briefly discussing related tracing tools, we bility, PatchWrx introduces the least amount of slow­ describe the PatchWrx toolset and specify the new down yet maintains the integrity of the address space. fe atures we have ad ded. We then analyze PatchWrx The next section discusses the Patch Wrx toolset. traces captured on Windows NT version 4.0, demon­ strating the capabilities of the tool while illustrati ng PatchWrx the importance of capturing operating system-rich traces. In the final section, we su mmarize the paper, Patch Wrx is a dynamic execution-tracing toolset discuss the current limitations of the toolset, and sug­ developed fo r use on the Alpha-based Microsoft gest new directions for development and study. Windows NT operating system. The toolset utilizes the Privileged Architecture (PAL) fa cility, also Trace Generation To ols referred to as PALcode, of the Alpha microprocessor to perform tracing with minimal overhead .2' PatchWrx Trace-driven simulation has been the method of can instrument, i.e., patch, all Windows NT applica­ choice for evaluating the merits of various architec­ tion and system binary images, including the kernel, tural trade-offs.'5 Traces captured fr om the system operating system services, drivers, and shared libraries. under test are recorded and replayed through a model The PAL fa cility is a set of architected fu nctions and of the proposed design. Computer architecture instructions that provides a consistent interface to a set researchers have proposed methodologies that capture of complex system functions. These routines provide both application and operati ng system references. primitives fo r memory management, context switch­ These tools include hardware- based"-10 and software­ ing, interrupts, and exceptions.

based' Hs methods. Some of the issues involved in cap­ turing operating system-rich traces are Patch Wrx and the Alpha PAL Routines The PatchWrx software tool is made possible through l. Tracing overhead (system slowdown) the PAL used by DIGITAL Alpha microprocessors. 2. Accuracy (perntrbation of the memory address space) PAL routines have access to physical memory and 3. Completeness (capturing all desired information, internal hardware registers and operate with interrupts e.g., the operating system reference stream) disabled. PALcode is loaded fr om disk at system boot time. We modifiedand extended the shrink-wrapped Table 1 contains a list of 10 tracing tools that have Alpha PALcode on a DIGITAL Alpha 21064-based been developed over the past 10 to 15 years. Although system to support the PatchWrx operations. The mod-

Tab le 1 Sample of Tracing To ols

Average Addr ess Operating Name Slowdown Pertur bation System Activity Platfor m

ATOM'3 lOX to lOOX No Yes DIGITAL Alpha UNIX ATUM'6 20X No Yes DIGITAL VA X OpenVMS EEL" lOX to lOOX Yes No SPARC Solaris Etch'" 35X Yes No Intel Microsoft Wi ndows NT V4.0 NT-Atom" lOX to lOOX No No DIGITAL Alpha NTV4.0 PatchWrx3 4X No Yes DIGITAL Alpha Microsoft Windows NT V4.0 Pixie'-0 lOX to lOOX Yes No DIGITAL MIPS ULTRIX

QPT12 lOX to lOOX Yes No SPARC Solaris, DIGITAL ULTRIX Shade2' 6X No No SPARC Solaris

SimOS14 1 OX to 50,000X No Yes DIGITAL Alpha UNIX, SGIIR IX, SPARC Solar is

Digital TechnicJ! Journal Vol. 10 No. I 1998 7 ified PatchWrx PAL routines serve two major pur­ We define a patched instruction as an instruction poses: ( l) to reserve the trace bufkr at system boot within an image's code section that is overwritten with time and (2) to log trace entries at trace time. an unconditional branch (BR) to a patch. The target of One way that PatchWrx maintains a low operating the BR contains the parch sec/ion. The patch section overhead is to store the captured trace in a physical includes the trap (CALL_PAL) to the appropriate PA L memory bufte r, which is reserved at boot time. The routine that logs a trace entry corresponding to the size of the buffe r can be varied depending on the type of instruction p<1tched and the return branch to amount of physical memory instal led on the system. the original target. Since we use PAL routines to reserve this memory, the PatchWrx docs not modify the original binary operating system is not aware that the memory exists images; instead , it generates new images that contain because the PALcode performs all low-level system ini­ patches. This operation preserves the original images tialization before the operating system is started. on the system in case they need to be restored . PatchWrx logs all trace entries in this buffer. Writing Instrumentation involves replacing all branching trace entries directly to physical memory has several instructions of type unconditional branch, conditional advantages. First, writing to memory is much faster branch (e.g., branch if equal to zero [BEQ]), branch th;m writing to disk or to tape. Second, using physical to subroutine (BSR), fu nction return (RET), jump memory allows tracing of the lowest levels of the oper­ (JMP), and jump to subroutine (JSR) within an ating system (i.e., the page fa ult h:md ler) without gen­ image's code section with unconditional branches to erati ng page fa ults. Third, using physical memor}' a patch section. If loads and stores are also traced, allows tracing across multiple threads running in mul­ PatcbWrx repl aces these instructions (e.g., load sign­ tiple address spaces regardJess ofwhich address space is extended longword [LDL]) with unconditional currently running. branches to tl1e patch section, where the original load To enable PatchWrx to operate under Windows NT or store instruction is copied . A return branch is also versions 3.51 and 4.0, we started with the PA L rou­ needed to returncontrol flow to the instruction subse­ tines modified by Sites and Perf and made additional quent to the original load. Wnen PatchWrx encoun­ modifications as required by the operating system ver­ ters this patch, the tool records the register value of the sions . These modifications were concentrated in the original load or store instruction in the trace log. The process data structures . The PatchWrx-specitlc PAL patch section contains all the patches fo r the image routines are listed in Ta ble 2. The first three routines and is added to the rewritten image. Figure l shows are used fo r reading the trace entries tl·om the buffe r examples of patched instructions. Patch\rVr x replaces and fo r turning tracing on and off. The remaining five only branch instructions within an image to reduce the routines are used to log trace entries based on the type type and number of entries logged in the trace buffe r. of instruction instrumented. Using these traced branches, the tool can later recon­ struct the basic blocks they represent. Patch Wrx Image Instrumentation As shown in Figure 1, PatchWrx replaces BR and Next we describe how we use PatchWrx to instrument JMP instructions with BR instructions that transfer Microsoft Windows NT images. Patching the operat­ control to the patch section. The original BR or JMP ing system involves the instrumentation of ::dl the instruction is repeated in the patch section fo r the pur­ binary images, including applications, operating sys­ pose of recording the value or· the target register (if tem cxecutables, libraries, and kernel. Once patching necessary) into the trace buffe r when the patched is complete, trace entries are logged by means or' PA L i mage is executed . This register value is necessary tor routines as images execute. reconstructing the traced instruction stream. Patch\Vrx

Ta ble 2 PatchWrx-specific PAL Routines

PA L Routines Function

PWRDENT Read a trace entry from trace memory PWPEEK Read an arbitrary location (for debug) PWCTRL Initialize, turn tracing on/off PWBSR Record a branch to subroutine PWJSR Record a jump/call/return PWLDST Record a load/store base register value PWBRT Record a conditional branch taken bit PWBRF Record a conditional branch fall-through bit

8 DigiL11 Te chnical Journ,ll Vu l. !0 No. l 1998 ORIGINAL CODE PATCHED CODE

B R l?l>.TCH . 001 EXAMPLE 1 MP ZERO , (R19) Jl!P Z'i8RO, (Rl 9)

PATCH . O Ol : CALL_PAL PltJJSR �lP ZERO , (R19)

EXAMPLE 2 JSR R2 6 , ( R1 9 ) ��� BSR R2 6 , PAT CH . 002

P.'\TCH . 0 0 2 : CALL_PAL PWJSR JMP ZERO , ( Rl9 )

EXAMPLE 3 BEQ R3,TARGET . 0 03 BEQ R3 . �RSE� . 002 BR PAT . H . 0 0 3 BACK . 003

PATCH . 003 : BEQ R 2 , PATCH . 003T Cli.LL_PAL PWBRF BR BACK . 0 0 3

PATCH . 003 T : CAL _PAL P BRT BR TARGET . 003

EXAMPLE 4 LDL R20, 4 ( R1 6) LDL R20 , 41Rl6 ) BR P TCH .004 Bli.CK . 04

1?/I.TCH . 0 0 4 : CALL_PAL PWLDST LDL R2 0 , 4(Rl6 ) BR 8 ACK . 0 04

Figure 1 Instruction Patch Examples

replaces JSR and BSR instructions with BSR patches. image size change becomes an issue fo r dynamically This replacement preserves the return address (RA) linked library (DLL) images. register fie ld value, which contains the return address fo r the subroutine. Again, the original instruction is Patching Dynamic Link Libraries repeated in the patch section fo r register value record­ The Microsoft Windows NT operating system pro­ ing during tracing to help fa cilitate reconstruction. vides a memory management system that allows shar­ Conditional branches have a larger and more com­ ing between processes.n For example, two processes plex patch than the other branch types because the that edit text filescan share the application original condition is duplicated and resolved within image that has been mapped into memory. When the the patch. The taken or fa ll-through path generates a first process invokes the editor, the operating system bit value when logged within the taken or fa ll-through loads the application into memory and maps the trace entry. The return branch in the patch section is a process's virtual address space to it. When the second replica of the original conditional branch. process invokes the editor, rather than load another Asexpla ined earuer, tor all patches, PatchWrx replaces editor image, the operating system maps the second the original branch with a patch unconclitional branch. process's virtual address space to the physical pages Since Alpha instructions are equal in size, this replace­ that contain the editor. Of course, both processes con­ ment process allows patching without increasing the tain local storage fo r private data. code size within the image. Although the code size DLLs are loaded into memory and shared in this remains unchanged, the image size will increase in manner. When patches are added to a DLL, the size of proportion to the number of patches added. This the image increases. When this image is mapped to

Digital Technical Journal Vol. . lO No. l 1998 9 physical memory (as per its preferred base load The branch target virtual address computation t-cJr

address), the larger image may overlap with another this fo rmat is newPC = (oldPC + 4) + (4 * sign­ image having J bJse address within the new range. cxtcndcd(2l-bit branch displacement)). The register This image overlap can prevent the operating system field holds the return address fo r BSRs. With this fr om booting properly: some environment DLLs will branch fo rmat and target virtual address computation, conflict in memory because they perform calls directly the Alpha architecture provides a branch target range into other DLLs at fixed offsets. To resolve this issue, of 4MB from an instruction's current PC. we rebase 24 the preferred base load addresses of the Several applications that run today on Microsoft patched DLLs, which modifiesthe base load addresses Windows NT version 4.0 are sufficientlylar ge that the of each patched DLL to eliminate conflicts. Re basing displacement between a control rl ow instruction to be affects the address accuracy of the patched S}'Stem, patched and the patch location within the patch section though we are able to readjust the addresses duri ng exceeds this 4-MB limit. (Recall that since we want to reconstruction. An increase in the paging activit\'ma y avoid moving code or data sections, the patch section is also be observed since the additional code may cross placed at the end of the image.) To address this problem, page boundaries. we developed two new branch instructions fo r usc with The original version of the PatchWrx toolset was PatchWrx. These new branches were not implemented developed on Microsoft Windows NT version 3.5. in the instruction set architecture of the Alpha architec­ When versions 3.51 and 4.0 were released, several mod­ ture. Instead, we used PALcodc to implement d1cm. The ifications were made tothe image fo rmat. In complet­ two new branches arc designated long branch (LBR) and ing the 3.51- and 4.0-eompatible versions ofPatchWrx, long branch subroutine (LBSR). Figure 3 illustrates the we bad to address this issue. One change that affected fo rmat of these two instructions. how we patch was the placement of the Import Address The computation of the target virtual address is

T1ble (IAT) into the front of the initial code section of newPC = (oldPC + 4) + (4 * sign-extended(25-bit

executable binary images. This table is used to look up branch displacement)) to r LBR branches and ncwPC =

the addresses of DLL procedures used (i.e., imported) (oldPC + 4) + (32 * zcro-cxtended(20-bit br::mch dis­ by the executable binary. In developing the current gen­ placement)) for LBSR branches. PatchWrx uses LBRs eration of Patch\Vrx, we had to make modifications to when patching any control flow instruction that has usc image header fields that had previously remained a displacement greater than 4 LV! B. PatchWrx uses unused or reserved, indicating the executable code sec­ LBSRs similarly fo r control How instructions that must tions that contained data areas. preserve the register fieldva lue. Another issue that we addressed in the recent modi­ \Vhcn an LBR or LBSR instruction is cxecu ted ficationsto PatchvV rx was long branches. The original within the image code section, a trap to PALcodc version of PatchWrx replaces all branch, jump, call, occurs. Normally, CALL_PAL instructions have one of and return instructions with either BR or BSR instruc­ several definedfu nction fields that cause a correspond­ tions to the patch section. Since the Patch Wrx tool has ing PAL routine to be executed . The two long branch no information about machine state during the patch­ instructions have fu nction fieldsthat do not belong to ing phase, it is impossible to utilize other branching any of the defined CALL_PAL instructions and there­ instructions (e.g., J MP or JSR instructions) to provide fo re fo rce an illegal instruction exception within the this branch-to-patch transition. Register and register­ PALcode. This PALcodc flow has been modified to indirect branching instructions would require per­ detect if a long branch has been encountered. turbing the machine state. Therefore, the developers could use only program counter (PC)-based offset branching instructions. As discussed previously, in replacing a control How instruction with a patch branch, PatchvVrx uses a BR 25-BIT DISPLACEMENT or BSR instruction in which the off-Set fi eld is set to branch to the corresponding patch within the image's LBR INSTRUCTION FORMAT patch section. The Alpha architecture branching instructions use the fo rmat shown in hgurc 2.

20-BIT DISPLACEMENT I OPCODE REG 21-BIT DISPLACEMENT 31 26 25 21 20 0 LBSR INSTRUCTION FORMAT

Figure 2 Figure 3 Al pha Branch Instruction Format PALcode Long Branch Instruction l-'ormars

10 Oi[!:iLal Technical Journal Vo l. 10 No. l 1998 AB shown in Figure 3, both long branch types have Tr ace Capture the same PALcode operation code (opcode) value of The PatchWrx toolset allows the user to turn tracing on 000000. To distinguish between the r-.vo types, the least and off and thus capture any portion of workload execu­ significantbit in the instruction word is set to 0 fo r LBRs tion. The tracing tool is also responsible fo r copying trace and to 1 fo r LBSRs. This bit is not included as a usable entJies fi-om the physical memory buffe r to disk. Copying bit for the displacement fields of either branch type. the trace buffer to disk is performed after u·acing has Consequently, each LBR has a 25-bit displacement field stopped so that the time required to perform the copy and each LBSR has a 20-bit field. With a 25-bit usable does not introduce any overhead during u·ace capture. displacement field,the PALcode performs the LBR tar­ PatchWrx logs a trace enu·y fo r each patch encoun­ get address computation, allowing a ±64-MB range. tered during program execution. AB .itexecut es instruc­ Since each LBSR instruction has a 20-bit displace­ tions witllin the code section, PatchWrx encounters an ment field, whereas the original Alpha architecture unconditionalPatchWrx branch.In stead of brancllingto branch displacement field is 21 bits, the target instruc­ the otiginal target, the patched branch transfers control tion address computation fo r LBSR instructions is per­ to tl1eimage's patch section . Witl1in the patch section, a fo rmed differently than tOr standard branches within PatcbWrx PALcall u·aps to the PAL routine correspond­ the PALcode. As shown in the address computation i.ngto tl1e patch type and logs a trace entry to tl1e trace equation, the 20-bit displacement is multiplied by 32 buffe r. The PAL routine then returns to the instruction rather than by 4 (as for the LBR branch). Notice that followingthe CALL_PAL insu·uction. PatchWrx uses an the 20-bit displacement is always zero extended. The unconditional branch to transfer control fi-om tl1e patch computation provides the LBSR instruction with a dis­ section back to the original target within an image code placement of +32 MB. section. During the execution of thePatchWrx PAL rou­ This computation procedure has two implications. tine, necessary machine state information is recorded First, LBSR instructions can only be used to branch and logged in the trace buffer. Thisallows fo r the capture from an image code section to an image's patch sec­ of register contents, process ID information, etc., which tion. Second, branches into the patch section are are used later during u·ace reconsu·uction. either BR or BSR instructions (or their long displace­ The trace capture £1 cility captures tl1e dynamic execu­ ment counterparts). PatchWrx uses only BR or LBR tion of a workload running on the system. To recon­ instructions to return from the patch section to the struct tl1e trace after it has been captured, the tracing original branch target within a code section; BSR and tool must also capture a snapshot of tlK base load LBSR instructions are never used. Therefore, restrict­ addresses of all active images on tl1e system. This snap­ ing LBSR instructions to use positive displacements shot serves as the virtual address map used in recon­ does not present a problem. structing the trace.Each active process and its associated The LBSR displacement multiplier value of 32 does libra.Jies is loaded into a separate address space, which present some restrictions, however. The multiplier may be different tha.Jl me preferred load address as spec­ value of 4 used in the original Alpha instruction set ified statically in tl1e image header. If each image was architecture represents the instruction word length loaded into memory at its preferred base address, tl1e of 4 bytes. Thus, normal branch instruction target virtual address map would not be necessary to perform addresses must be aligned on a 4-byte boundary. By reconstruction. Instead, PatchWrx could map target using the multiplier value of 32 fo r LBSR instructions, addresses from the trace buffe r using the base address LBSR target addresses are restricted to align on a 32- values contained in tl1e static image headers. byte (i.e., eight-instruction) boundary. Since allLBSR The type of trace record that PatchWrx logs into the targets reside within the patch section, this restriction trace buffer depends on the type of branch or low-level does not pose a problem. If an LBSR is to be inserted PAL fu nction being traced. Figure 4 shows the trace into the image code section and the next available record fo rmats. The first three trace entry fo rmats patch target address is not aligned properly, PatchWrx consist of an 8-bit opcode and a 24-bit time stamp. can insert no operation (NOP) instruction words and The time stamp is the low-order 24 bits of the CPU advance the next available patch target address until cycle counter. The 32-bit field of these three formats the necessary alignment is achieved. PatchWrx never depends on the type of trace entry logged. The .first executes the NOPs; they are inserted fo r alignment fo rmat is used fo r target virtual addresses fo r all purposes only. Although inserting these NOP instruc­ unconditional direct and indirect branches, jumps, tions increases the image size, we have implemented calls, returns, interrupts, and returns fr om interrupts. several optimizations into the instrumentation algo­ The 32-bit fieldof the second fo rmat is used to record rithm to minimize this increase. For example, a queue the base register value to r traced load and store is used to hold LBSRs that do not align. As LBR instructions and stack pointer values that are fl ushed patches are committed, PatchWrx probes the queue to into the trace buffe r during system caJis and returns. determine if any LBSRs align fi·om their origin to the The 32-bit fieldof the third fo rmat is used fo r logging newly available patch target offset. the current active process ID at a context swap.

Digita} Technical Journal VoJ. 10 No. 1 1998 11 OPCODE TIME STAMP TARGET PC

8 24 32

OPCODE TIME STAMP BASE REGISTER VA LUE

8 24 32

OPCODE TIME STAMP NEW PROCESS 10

8 24 32

r-- OPCODE \ START BIT J I VECTOR OF 60 TA KEN/FALL-THROUGH TWO-WAY BRANCH BITS 3 1 60

Figur e 4 Trace Entry Formats

The fo urth trace entry type is used fo r tracing con­ Using the first target virtual address and process ID ditional branches. It uses a 3-bit opcode and up to 60 pair fr om the captured trace, trace reconsu·uction con­ taken/fall-through bits. A start bit is used to deter­ su lts the virtual address map to determine in which mine how many bits are active. The start bit is set to image the instruction fa lls (based on its dynamic base l ifa conditional branch is taken and to 0 ifthe branch load address) and where that image is physically is not taken . This recording scheme allows a compact located on the system. The tool consults the patched encoding of conditional branch trace entries. During image to determine the actual instruction at the target trace reconstruction, PatchWrx uses conditional branch address, records this instruction, and then reads the trace entries to reconstruct the correct instruction next insu·uction from the patched image. This process flow when conditional branches are encountered and continues until reconstruction encounters either a to provide concise information about when to deliver conditional branch or an unconditional branch. A interrupts in loops. conditional branch causes the tool to check the first active bit of the current taken/fall-through entry to Tr ace Reconstruction determine su bsequent control flow; the process then The reconstruction phase is the final step in generating continues at that address. Ifan unconditional branch is a fu ll instruction stream of traced system activity. As encountered, reconstruction records the entry and shown in Figure 5, trace reconstruction requires sev­ checks it against the next captured trace en try. If the eral resources in order to generate an accurate instruc­ tvvo entries match, the tool outputs the recorded tion stream of all traced system activity. instructions to an instruction stream file, consults the Trace reconstruction reads and initializes the head­ captured trace entry fo r the next target instruction vir­ ing of the captured trace, which includes a time stamp, tual address, and repeats the procedure until the entire the name of the user who captured the trace, and any captured trace has been processed. important system configuration information, e.g., the Since PatchWrx captures interrupts and other low­ operating system version number. Next, reconstruc­ level system activities (e.g., page fa ults) in the trace, tion reads the first fo ur raw trace records, which are these activities must also be reconstructed. When automatically entered whenever tracing is turned on. PatchWrx logs an interrupt into the trace buffer, the These records contain the first target virtual address, corresponding target virtual address in the captured the active process ID, the value of the stack pointer, record represents the address of the rl rst instruction and the first talcen/fall-through record to be used not executed when the interrupt was taken. PatchvV rx (such records always precede the branches they repre­ flushes the currently active taken/fall-through entry sent). PatchvV rx uses this information to initialize the to the memory buffe r and initializes a nevv taken/fall­ necessary data su·uctures of the reconstruction process. through enu·y. This newen try will be responsible fo r

12 Digital Technic:�] Journal Vo l. 10 No. I 1998 PATCHED PATCHED IMAGE IMAGE

PATC HED I - IMAGE

RECONSTRUCTED INSTRUCTION STREAM

CAPTURED RECONSTRUCTION RAW TOOL TRACE

VIRTUAL ADDRESS MAP

Figur e 5 Instruction Stream Reconstruction Resources the conditional branches encountered beginning with 2. DLL domain-Wi n32 user (e.g., kernel32 dll, the interrupt service routine. The address of the first user32.dll, and ntdll.dll) instruction within the interrupt service ro utine is then 3. Operating system domain-Win32 kernel, kernel, logged in the trace. system processes, system idle process (e.g., During reconstruction, the reconstruction tool looks Win32K.sys, ntoskrnl.exe, drivers, and the spooler) fo r the interrupt's first unexecmed instruction address to know which instruction to stop at when recon­ Examining the e times provides insight into a work- structing the instruction stream. The tool then begins load's use of each domain. We also examine DLL and reco nstructing the instruction stream, including the system service usJge on an image basis fo r each work­ interrupt handler stream. If the unexecuted instruc­ load. Tlus breakdown helps us more clearly identif)r the tion is within a loop, trace reconstruction utilizes the dependence between the workloJd and the system ser­ taken/fall-through entry convention. On ta king the vices provided by the Windows NT operati ng system. interrupt, the active taken/fall-through record is flushed We also present the instruction mix of each workload and another record is started. This process al lows the with and without the inclusion of the operating system tool to continue to reconstruct iterations of the loop execution . Understanding the djfferences in instruc­ until all the taken/fall-through bits are exhausted . tion composition in the presence of system activity fur­ ther highlights the behavior lacking in application-only Operating System-Rich Workload traces, such as increases in branch and memory instruc­ Characterization tions, when compared to application-only workloads. We present the average basic block lengths fo r each As presented in the study by Lee et al .;' desktop appli­ domain of execution ( Jpplication-only, DLL, operating cations and benchmarks share some workload charac­ system) separately and then in co mbination. This met­ teristics, but applications alone do not represent fu ll ric reveals which workload domai n dominates the system behavior. To investigate and address system branc hing behavior. Casmira's work provides a more design issues, computer architects should use operat­ complete description of these differences across a wider ing system-rich traces. set ofworkload characteristics.2; To illustrate this point, we present a sample of the vJrious workload characteristics tbat exist in a set of Workload Descrip tions benchmark and desktop applications specially selected We pertonn ed all the experiments reported on in this to study the diffe rences in the use of the operating sys­ paper on a DIGITAL Alpha plattorm running the tem and related services. The firstchar acteristic we dis­ .Microsoft Windows NT version 4.0 operating system. cuss is the amount oftime each benchmark or desktop We captured the traces on a 150-megahertz Npha application spends within three domains: 21064 processor. The system configuration included 80 MB of physical memory. TJ ble 3 lists the workloads l. Application-only domai n (e.g., winword .exe and we examined. excel.exe)

Digital Technical Journal Vo l. 10 No. 1 1998 13 Ta ble 3 Wo rkload Description

Workload Description

fourier BYTEmark benchmark; a numerical analysis routine for calculating series approximations of waveforms neural BYTEmark benchmark; a small, functional back-propagation network simulator go SPEC95 Go! game benchmark li SPEC95 Lisp interpreter benchmark cdplay Microsoft CD Player playing a music CD fx !32 DIGITAL FX 132 V1.1 interpreting/translating included OpenGL sample x86 application ie Microsoft Internet Explorer V2.0 fo llowing a series of web page links vc50 Microsoft Visual C/C++ VS.O com piling a 3,000-line C program word Microsoft Wo rd97 V7.0, spell-checking a 15-page document

The fo urier and neural workloads are fr om the To provide a clear and represent::�tivc comparison BYTEmark benchmark test suite : the neural workload ohvorkload behavior, we captured several traces. For is a small array- based floating-point test; the fo urier all scenarios, fu ll traces of each workload captured workload is designed to measure transcendental and approximately 5 to lO seconds of execution, f-i lling the trigonometric fl oating-poi nt unit performance. 45-MB trace buffer. To characterize worldoad behav­ The go and li workloads a.rc !Tom the SPEC95 integer ior, each experiment w:1s run with the benchmark or benchmark suite: the go workload is a simulation of the application as the only activity on the system. Each game Co1, witl1 iliecomputer playing against itselr; ilie li workload was run in the !-(>reground. workload is a Lisp interpreter. All the workloads use ilie To ensure that the traces captured were representa­ standard inputs provided vvith tl1c benchmarks and are tive of the overall worldo:�d behavior, we captured com piled withthe default optimiz:.tion level using the multiple traces. We chose diffe re nt points during exe­ native Alpha version of Mi crosoft C/C++ version 5.0. cution fo r tracing to allow comparison between difter­ The cdplay workload is the Microsoft CD Player en t portions of the selected scen:�rios. To investigate application included in Microsoft Windows NT ver­ the variability present in selected workloads, we tr:�ced sion 4.0. The device w:.straced while playing a music additional scenarios. A second Microsoft Wo rd trace CD using default playing options (e.g., playing all the was captured with the application performing an auto­ songs in order). fo rmat operation of the same document used in the The 6.:'32 workload is the DIGITAL FX'32 version 1.1 first trace of the spell-check operation, and we cap­ emulator/translator provided by Compaq's DIGITAL tured a second Microsoft Internet Explorer tr;Ke, Alpha Migration To ols Group.1" We ran the robot arm repeating the Sony links but with the links cached. We OpenGL sample Intel-based application in the fo re­ captured a second trace of FX'32 using the included ground during trace capture. boggle sample game (tor comparison against using the The ie workload is the st:.ndard Microsoft Internet OpenGL application input). Additionally, the FX132 Explorer version 2.0 workload included in lvl icrosoft translator was traced while it optimized a n:�tive Intel Windows NT version 4.0. The ie workload was traced x86 application's profile.To condense the number of while traversing fo ur links through the Sony home memory pages occupied by an im:�gc, Microsoft web page, arriving finally at the Sony PlayStation Store designed the new linker to allow data to reside wi r-hin we b page . The trace was captured on May 4, 1998; the code regions. Hookway and Herdeg"' provide :1n pages may have changed since this date. The history explanation of the DIGITAL FX1 32 emulation and cache and the web link cache were both empty when t:ranslationjoptimization procedures . Casmira discusses the trace was captured. iliese scenarios and others .' · The vcSO workload is tl1c Microsoft C/C++ version 5.0 compiler compiling a 3,000-line C source code tile. Domain Mix We used the command line interrace, and we used the To illustrate the inherent diffe rences between bench­ default optimization levels and oilier parameters, which mark and desktop application behavior, we break best represented ilie common usage of tl1e compiler. down the captured trace in terms of three mmually The word workload is Microsofi: \No rd fr om the exclusive domains. These domains arc ( l) application, Microsofi: Office97 desktop applic::�tion suite to r the (2) DLL, and (3) operating system. The application Alpha processor used to capture :1 manual spell check domain represents the set of-'execut ed instructions that of a 15-page MicrosoftWo rd document. The standard are within the traced application's executable im::�ge. Microsoft Wo rd dictiorury was employed.

14 Digital Tcdmi'

100 KEY:

APP 90 DLL OS

80

70 i=' z lJ.J � 60 lJ.J e:. � 50 f= iii 0 � 40 0 u 30

20

10

0 FOURIER NEURAL GO Ll COPLAY FX132 IE VC50 WORD

WORKLOAD

Figure 6 Donuin Execution Mix

Digital Technical Journal Vol. 10 No. l 1998 15 although small, is present; all I/0 must be accessed by number of procedure caJis in desktop applications wi ll means of a system service. be higher tl1an the number of ca!Js in benchmarks. The Microsoft Word spell-checking service is pro­ Second, real applications depend not only on system vided by means of a DLL included with the :1pplication. DLLs but also on their local DLLs. We see this beha,�or Thus fo r the word workload, this DLL handles both the explicitly \\�th tl1e MicrosoftWo rd application. search through the document and the successive diction­ ary lookups. Operating system services are required fo r Instruction Mix :1ccessing portions of the file residing on disk (not in Although understanding the domain mix and image memory pages), fo r displaying the search and compare usage helps identifYdi fferences between benchmarks results to the user, and for performing the user-driven and desktop :1pplications, we would like to look deeper I/0 associated with accepting/rejecting word replace­ within each domain to see inherent differences that ment choices (prompted by the spell-checking tool). affect design decisions. Figure 7 shows tJ1e application­ Figure 6 shows the consistent pattern of instruction only instruction mix (i.e., the instruction mix fo r only domains that the fo ur benchmarks fo llow in contrast to the application and application-specificDL Ls) to r each the v:uiability in the insn·uctionmix domain of the deskrop workload. Each entry in the legend represents a class :1pplicationworkl oads. Even though there is slight operat­ of instructions f( )Lmd within the application domain. ing system acti,�ty fo r go and li (attri butable to I/0 ser­ The y-axis denotes the percent composition of the vices), the benchmarks spend practicallyall their execution trace; the workloads are displayed on tJ1e x-axis. \vithin their application im ages; no DLL usc is visible. Note that the instruction mix tor the fx132 workload Clearly these benchmarks do not utilize system services to is zero. This value is a result of the lack of execution the level observed in the commercial desktop workloads. within the application image itself. Referring back to With the exception of the CD player, the commercial Ta ble 5 and the domain instruction mix, note that desktop applications examined use DLLs more heavily nearly all the workload execution is within DLLs (some than they do operating system services. This is especially execution is within ntoskrnl.exe ). The remaining work­ true in the I:X132 and word workloads, which carry out the loads consist mainly of load, store, conditional branch, tasks caprured inthe traceby means ofDLL routines. and arithmetic and logic unit (ALU) logicoperations. No overriding ch:�racteristicdif ferentiates benchmarks Characterization of Image Usage and desktop applications. Note the significantvar iabil­ ityin the instruction mix among the different bench­ To investigate the domains present in the trace at the marks and among the diffe rent desktop applications. image level, we identified the top ti ve most he:lVily Figure 8 shows tJK instruction mix of the entire used images, based on the number of instructions exe­ trace . The firstand most noticeable difference between cuted in each image . First, an expbn:nion of some of the application domain and fu ll-trace instruction mix the more freq uently used system exccutables and figures is the increase in instruction types present in D LLs is in order. Table 4 lists the names of the com­ the trace. Nine instruction classes were present in the monly used images and a brief description ofea ch. application domain instruction mixes, while 17 are We present tJ1e image usage of the nine traces. This present in the fu ll-system traces. Wo rth noting is the characterization includes all the images (e.g., execura­ presence of 6 CALL_PAL instruction types ( aJl use the bles, DLLs, services, and drivers) listed in Table 5. The same opcode, but invoke 6 different PAL routines) data helps demonstrate several points. First, commercial in the full traces. Since each executed CALL_PAL desktop workloads spend a lot more time inDL Ls than instruction causes a trap that takes on the order of tens benchmarks do. Consequently, we can project that tl1C of cycles to complete, we can conclude that this is a

Ta ble 4 Common System Images

Name Descr iption

ntoskrnl.exe Windows NT operating system kernel core hal.dll Hardware Abstraction Library (HAL), which is responsible for the underlying hardware interface kernel32.dll Main kernel library win32k.sys Kernel-mode device driver gdi32.dll Graphics display interface library ntdll.dll Library routines provided to each client process on the Windows NT system MSVCRT.dll Microsoft CJC++ run-time library s3.dll Graphics adapter library for the test platform qv.dll Graphics adapter library for the test platform

16 DigitJI Te chnical Journal Vo l. 10 No. I 1998 Ta ble 5 The Five Most Frequently Used Images in Each Application or Benchmark

Image Name Workload (Percentage of Total Number of Instructions Executed within the Image) fourier bytecpu.exe winsrv.dll win32k.sys ntoskrnl.exe user32.dll Other (99.5%) (0.2%) (0.1 %) (0.1%) (0.02%) (0.08%) neural bytecpu.exe winsrv.dll ntoskrnl.exe win32k.sys ntdll.dll Other (99.7%) (0.2%) (0.03%) (0.03%) (0.02%) (0.02%) go go.exe win32k.sys ntoskrn l.exe hal.dll qv.dll Other (95.5%) (2.0%) (1.0%) (0.4%) (0.1%) (1.0%) li li.exe win32k.sys ntoskrnl.exe user32.dll qv.dll Other (97.7%) (1 .0%) (0.6%) (0.1%) (0.1 %) (0.5%) cdplay ntoskrnl.exe hal.dll win32k.sys tcpip.sys winsrv.dll Other (81 .8%) (14.7%) (1.1 %) (0.4%) (0.3%) (1 .7%) fx !32 hal.dll s3.dll OPENGL32.DLL MSVCRT.dll GLU32.dll Other (42.5%) (24.6%) (12.2%) (1 1.7%) (2.7%) (6.3%) ie iexplore.exe win32k.sys ntoskrnl.exe Fastfat.sys ntdll.dll Other (37.2%) (19.3%) (17.5%) (6.1%) (6.0%) (13.9%) vc50 c1 .exe ntoskrnl.exe MSVCRT.dll Ntfs.sys win32k.sys Other (83.1 %) (10.5%) (2.8%) (1 .2%) (1.1 %) (1 .3%) word MSSP232.DLL MSGREN32.DLL ntoskrnl.exe win32k.sys hal.dll Other (36.4%) (34.0%) (10.2%) (7.7%) (4.0%) (7.7%)

significantinsight into the system's inherent run-time were executed within the fx132 application image. Both latency, not visible with application-only workloads. the i.e and the word workloads introduce CALL_PAL Next note the striking similarities in instruction instructions when including the operating system. The mix fo r the fo ur benchmarks in Figures 7 and 8. i.e instruction mix shows an increase in jumps, calls, and Benchmarks do not interact with the operating system returns, which most likely reflects the increase in sub­ in any significant manner. The desktop application routine calls fo r system services. The word instruction workloads, however, show significant differences mix expetiences a reduction in load instructions from between the application domain and the complete approximately 52 percent to 35 percent. This decrease

trace instruction mixes. can be attributed to the increase i.n ALU operations pre­ The number of store instructions fo r the cdplay sent when operating system activity is included. workload decreases from about ll percent to approxi­ The results presented in Figures 7 and 8 reinforce mately 1 percent. The number of BSR instructions the points that benchmarks do not represent true desk­ increases fi·om l percent to about 6 percent. Most top workloads and that tl1e desktop workloads display interesting fo r this application is the decrease in the significantlydiff erent characteristics when viewed in the number of ALU operations from almost 30 percent to presence ofsystem activity. about 2 percent, while the number of CALL_PAL instructions increases from 0 to 21 percent. Referring to Average Basic Block Length Figure 6, the domain execution mix plots clearly show Includi.ngthe operatingsyste m activity in our traces yields why the differences to r this workload are so large when an overall increase in the percentage of control How the system activity is included-more than 95 percent instructions present. Figure 9 shows a consequence of of the workload trace is operating system execution. thisfa ct. In this figure,we present the average basic block Considering the latency incurred by executing length fo r each worldoad, on a per-domain basis. The CALL_PAL instructions, clearly an optimization that ALL bar is the average basic block length across all concentrates on improving ALU operations based on domains; OS denotes the operating system instructions the application domain instruction mi,xes would have a only; DLL denotes the workload's DLL instructions much smaller impact on the true system performance. only; APPDLL denotes the combined application and The measured diffe rence in instruction mix under­ DLL instructions; and APP denotes the application scores the importance not only of using real workloads instructionsonly. fo r trace-driven simulations but also of including the Inspecting the fo ur benchmarks, we notice little dif­ operating system behavior in order to see the full picture. fe rence between the application-only basic block The fx!32 complete trace instruction mix is, of lengtl1 and the overall basic block length. Referring to course, completely different from the application our domain instruction mix figure, recall that the instruction mix of Figure 7, in which no instructions benchmarks spend about 95 percent of their execution

Digital Technical Journal Vol. 10 No. 1 1998 17 100 KEY:

ALULOG 90 JSR RET

80 LD ST BRXX 70 i=' BR :z UJ BSR JMP ii 60 UJ e:. :z 0 50 f= Vi � 40 :2 0 u 30

20

10

FOURIER NEURAL GO Ll CDPLAY FX!32 IE VC50 WORD

WORKLOAD

Figure 7 Application-only Instruction Mix

100 KEY:

ALULOG PMISC 90 SWAPIROL RETSYS RDTHREAD 80 RDTEB CALLSYS MB 70 TRAPB BSR i=' :z BR 60 � BRXX a: UJ o._ ST :z LD 0 50 RET f= Vi JSR 0 o._ JMP � 40 u

30

20

10

FOURIER NEURAL GO Ll COPLAY FX132 IE VC50 WORD

WORKLOAD

Figure 8 Complere Trace Instruction Mix

18 Digital Technical Journal Vo l. 10 No. I 1998 KEY: 25

ALL

OS

DLL

APPOLL

APP 20

1- z 6 1 5 u I z 0 i= u ::> a: 1- U) � 10

l n � 5 : I

0 FOURIER NEURAL GO Ll COPLAY FX'32 IE VC50 WORD

WORKLOAD

Figure 9 Average Basic Block Length

within their executable images. Therefore, including The vc50 workload spends a significant amount of any operating system activity into a basic block length time within its own executable image, which leads to average has a minimal effect. an overall average basic block length si milar to the However, considering the large amount of operat­ application-only value. The word workload is similar, ing system execution present in the cdplay trace, the but the DLL behavior dominates. The cdplay and ie overa ll basic block length is significantly Jess than the workloads experience a 50 percent decrease in average application-only length. The overall and operating basic block length. This decrease can be attributed to system length values are almost the same. Not only an increase in the number of branches in the presence does including the system activity in the trace intlu­ of operating system activity. With this increase in con­ ence the overall basic block length but the amount trol fl ow instructions, we expect increased pressure to ofsystem activity determines to what degree the length be placed upon the branch prediction hardware. is affected . As observed in other characteristic categories, the In a si milar fa shion, the overall basic block length of fo ur benchmarks do not exhibit noticeable deviations the fx!32 trace tracks that of its DLLs. The length is fr om application-only be havior when the operating directly proportional to the amount of time the work­ system activity is introduced. Again this explains why load spends in its DLL domain. The execution of the ie simulation results using benchmark traces usually track workload is more evenly distri buted among the three the actual performance when the benchmarks are run domains, which affects tl1e overall basic block length, on the real system. In contrast, fo ur ofthe five desktop producing a more evenly weighted average of all its applications exhibit significantlydif fe rent behavior in domain basic block lengths (no one domain dominates). the presence ofthe operating system.

Digiral l.edmical Journal Vo l. 10 No. 1 1998 19 Summary 6. J. Emer and D. Clark, "A Characterization of Proces­ sor Performance in the VA,\. ll-780," Proceedillf.;s u/ In this paper we described the PatchWrx toolset. We rhe Eleue nrh Sympos ium on Co mputer Architecture (June compared it to existing tools and demonstrated the 1994 ): 126-1 35. need for operating system-rich traces by showing the 7. K. Flanagan, J. Archibald, B. Nelson, and K. Grim­ amount of the total execution spent in the kernel and srud, "BACH: BYU Address Collection Hardware; the DLLs. In addition, we showed that nisting desk­ The Collection of Complete Traces," Proceedings of top benchmarks do not exercise the kernel and the the Sixtb International Olllfe reucc un Jfodeling Tech­ DLL sufficiently to provide meaningful indicators of niques and To ols /or Computer Fmlwttioll (1992 ): 51-6 5. desktop pertonnance.

These resu lts have reintorced our argument that 8. D. Kaeli, 0. LaMaire, 'vV. White, P. Henner, and W. researchers need to use traces with both application Starke, "Real-Time Trace Generation," !ntemutiuual and operating system information , especially as new Journal 011 Computer Simulation. vol. 6, no. 1 ( 1996 ): applications spend more time executing within the 53-68.

operating system. The goal is fo r computer architects 9. D. Ka eli, L. Fong, D. Re nfrew, K. Imming, and to usc operating system-rich traces of applications that R. Booth, "Performance Analysis on J CC-NUM.A dominate the desktop market. Prototype," IBM .foumal ol l

11. B. Chen and B. Bershad, "The Impact of Operating Acknowledgments System Structure on Memory System Performance," Op erating Svstems Ret•iew. vol. 27, no. 5 (December We would like to acknowledge the help and advice of 1993): 120-1 33. the following people: Richard Sites of Adobe Systems; 12. J. Larus, "Abstract Execution: A Technique tor Effi­ Sharon Smith, Geoff Lowney, Joel Emer, Steve ciently Tracing Programs," Tec hnical Re port, CS-TR- Thierauf, Tom Wenners, Pa ul Delvy, and Dan 90-912, Universitv ofVVisconsin-Madison, 1990. Lambalot, all fr om Compaq Computer Corporation; A and Ro bert Davidson fr om Microsoft Research. Jason 13. A. Srivastava and A. Eustace, "ATOM: System Casmira and David Kaeli have been su pported by a fo r Building Customized Program Analysis Too ls," Proceedings o/ the ACM SJG!'L4:V94 Omji:reuce 011 National Science Foundation CAREER grant. Programming Lungttct,�e r>es(t;ila11d !mplementatioll. Orloudo. Fla. (June 1994): 196-205. References and Notes 14. M. Rosenblum, S. Herrod, E. Wirchcl, and A. Gupta, l. SPEC Ne u'Sietter(September 1995 ). "Complete Computer System Simulation: The SimOS Approach," JEF:F. .foumal of Pa rallel aud Distrlhu ted 2. Information about the BYTEmark benchmark suite is Te ch nology, 1998, fo rthcoming. available fi·om BYTE Magazine ar http:/ jwww. byte. com/bmark/bmark.hrm. 15. M. Rosenblum, E. Bugnion, S. Devine, and S. Herrod, "Using the SimOS Machine Simulator to Study Com­ 3. S. Perl and R. Sites, "Studies of Windows NT Perfor­ plex Computer Systems," ACJ1 Transactio/Is IJ/1 .\llod­ mance Using Dynamic Execution Traces," Proceed­ eling and Sim ulation, vol. 7, no. I (January 1997): ings o/ the Secoud fSIW!X .�vrnposium on Op era ting 78-103. S),stem f)esig n and lmplcmentmiun (October 1996 ): 169-1 83. 16. A. Agarwal, A nazvsis o/ Ca che Perjorma neej(;r Op er­ ating Sy stems a11d Multipru,q ra rnming ( Kluwer Acade­ 4. D. Kaeli, "Issues in Trace-Driven Simulation," Lecture mic Publisher, 1989). No tes in Computer Science, f\iu. 729, Per/ormance Eualu atiou of Comp uter and Communication 17. ]. Larus and E. Schnarr, "EEL: Rewriting Executable Svstems. L. Donatiello and R. Nelson, eds. (Springer­ Files to Measure Program Behavior," Pmc('edings of Ve rlag, 1993): 224-244. the AC!VI SIG'PLA N"95 Co nference 011 Pn��ran1111i11g Language Desi_q 11 and Implementation. La jolla, Calif 5. R. Uhlig and T. Mudge, "Trace-Driven Memory Sim­ (June 1995): 29 1-300. ulation: A Survey," AC/11 Comfllltillg Surn·:Fs, vol. 29, no. 2 (J unc 1997): 128-1 70.

20 Digital T.:d1nic.1l journal Vol. 10 No. l 1998 18. D. Lee, P. Crowley, ].-L. B:ter, T. Anderson, and B. Bershad, "Execution CharJCteristics of Desktop Applications on Windows NT," Proceedings of the Twenty �jifih International -� ymposiu m on Co mputer Architecture. Barcelona, Spain (June 1998 ).

19. E. Bem, D. Hunter, and S. Smith, "Moving ATOM to

Windows NT for Al pha," Dtj; ital Techn ical jo urnal. vol. 10, no. 2, accepted fo r publication.

20. M. Smith, "Tracing with Pixie," Technical Report, David P. Hunter CSL-TR-9 1 -497, Stanford Unive rsity, November 1991. David Hunter is the engineering manager of Compaq Computer Corporation's Advanced and Emerging 21. R. Cmelik and D. Keppel, "Shade: A Fast Instruction­ Technologies Group. Prior to that he was the manager Set Simulator fo r Execution Profiling," Proceedings of of DIGITAL's Software Parmer Engineering Advanced Development Group, where he was involved in perfornnnce ACM S(qmetrics (May 1994): 128-137. investigations of databases and their interactions with the 22. Alphu AXP A rchitecture Ha ndhnok. Order No. EC­ UNIX and Windows NT operating systems. He has held QD2KA-TE (Maynard, Mass.: Digital Equipment positions in the Alpha Migration Organization, the ISV Corporation, October 1994 ). Porting Group, and the Government Group's Technical Program Management Oftice. David joined DIGITAL's 23. H. Custer, Inside Windows NT (Redmond, Wash .: Laboratory Data Products Group in 1983, where he devel­ Microsoft Press, 1993). oped the VA..,'\labUser Management System. He was the project leader of the advanced development project, ITS, an 24. Microsoft Sothvare Developer's Toolkit. This toolkit is executive information system, tor which he designed hard­ available :tt http://ms dn.microsoft.com/developer/ ware a.nd sothvare components. David has two patent appli­ sdk/plattorm.htm. cations pending in the area of sothvare engineering. He holds a degree in electrical and computer engineering ti·om 25. J. Casmira, "Operating System Rich Workload Char­ NortheasternUn iversity in Boston, Mas achusetts, and a acterization," Master's thesis, ECE-CEG-98-0 18, diploma in National Security and StrategicSmdies fTom the Northeastern University, May 1998. Naval War College in Newport, Rhode Island . 26. R. Hookw<1Y and M. Herdeg, "DIGITAL FX!32: Combining Emulation and Binary Translation," Digital Tecbnicaljournal.vol. 9, no. 1 (1997): 3-12.

Biographies

David R. Kaeli Da,�d .Kadi received Ph.D. ( 1992) and B.S. ( 1981) degrees in electrical engineering trom Rutgers University and an M.S. degree in computer engineering trom Syracuse University in 1985. He joined the electrical and computer engineering faculty at Northeastern University in 1993 afterspending 12 years at IBM, the last 7 of which were at the IBM T. j. Jason P. Casmira vVatson Re search Center in Yorktown Heights, New York. Jason Casmira received B.S. and M.S. degrees in electrical David is the director of the Northeastern University engineering ri·om Northeastern University in 1996 and Computer Architecrure Research Laboratorv (NCCAR), 1998, respectively, and is pursuing a Ph.D. degree in com­ where he investigates the performance and design ofhigh­ puter science at the University of Colorado, Boulder. For perfo rmance computer systems and sothvare. His current the past two yc<1rs, ] ason was a member of the Northeastern research topics include 1/0 worklo:�d characterization, University Computer Architecture Research Laboratory branch prediction snrdies, memory hierarchy design, object­ (NUCAR), where he focused on developing the current oriented code execution pertonnance, 3-D microelectronics, version of the P:nchWrx tracing toolset. He also investi­ and back-end compiler design. He frequently gives tutorials gated issues related to swdying operating system-ric h on the subject of trace-driven char

Digital Te chnical journal Vol . 10 No. I 1998 21 I

Avrum E. Itzkowitz Lois D. Foltan Automatic Template Instantiation In DIGITAL C++

Automatic template instantiation in DIGITAL C++ The te mplate racility within the C++ language allows version 6.0 employs a compile-time scheme that the user to provide a template ro r a class or fu nction generates instantiation object files into a reposi­ and then apply speci fic arguments to the tem pbte to speci�ra type or fu nction . The process of applying tory. This paper provides an overview of the C++ arguments to a template, referred to as te mplate instan­ template facility and the template instantiation tiation, causes specificcode to be generated to imple­ process. including manual and automatic instan­ ment the fu nctions and static data members of the tiation techniques. It reviews the features of instantiated template as needed by the program. template instantiation in DIGITAL C++ and Automatic te mplate instantiation relieves the user of focuses on the development and implemen­ de termining wh ich template entities need to be instan ­ tiated and where they should be instantiated. tation of automatic template instantiation in In this paper, we review the C++ template facility and DIGITAL C++ version 6.0. describe approaches to implementing automatic tem ­ plate instantiation. We fo llow that with a discussion of tl1 e fa cilities, rationale, and experience of the DIGITAL C++ automatic template instantiation support. We men describe the design of the DIGITAL C++ version 6.0 automa tic template instantiation fa cility and indi­ cate areas to be explored tor further improvement.

C++ Template Facility

The C++ langu age provides a template fa cility that allows the user to create a r:m1ily of classes or functions that are parameterized by type Y For example, a user may provide :1 Stack template , which defines a stack class fo r its argument type. Consider the fo llowing template declara tion :

templa te class Stack ( T * top_of_s tack; p bl ic : void push ( : aL l; voi po ( T · ar ) ; } ;

The act of applying the arguments to the tem plate is referred to as tem plate instantiation. An i nstantia ­ tion of a template creates a new type or functi on that is defined fo r the specif-ied types. Stack creates a class that provides a stack of the type int. Stack creates a class that provides a stack of user_class . The types int and user_class are the argu­ ments fo r the tempiJte Stack.

22 Digir:liTechnicol Journ:�l Vol . 10 No. I 1998 In general, :1 template needs to be instantiated when defined directives or pragmas. Since instantiations are it is referenced. When a class template is instantiated, given global external linkage, the user must ensure only those member functions and static data members that the specified te mplate instantiations appear only that are referenced are also instanti:Hed . In the Stack once throughout all the modules that compose the example, the member fu nction Push of the class program. When only this mode of instantiation is Stack needs to be instantiated only if it is used. used, the user also must ensure that all required tem­ Template functions and static data members have plate instantiations are specified to avoid unresolved global scope; therefore, only one instantiation of each symbols at link time. should be in a user's application. Since source files are compiled separately and combined later at link time to Co mmand-line Instantiation Command-line options produce an executable, the compiler alone is not able can be used to specif)' template instantiation. They are to ensure that one and only one instance of a specific similar in operation to the explicit instantiation req uests, template is efficiently generated fo r any given exe­ except they indicate groups of templates that shou ld be cutable. That is, the compiler by itself is not able to instantiated, rather than naming specifictempl ates to be know whether the function or variable definition for a instantiated . The command -line options include specitlc template is satisfied by code ge nerated in • Instantiate Al l Te mplates. A command-line option another object module. can direct the compiler to instantiate all template The C++ Standard provides fa cilities fo r the user to entities whose definitions are known during compi­ specif)' where a template entity shou ld be instantiated.' lation and whose argu ment types are specified. This When the user explicitly specities te mplate instantia­ has the advantage of spe cifYing many te mplate tion, the user then becomes responsible fo r ensuring instantiations at once. The user must still ensure that there is only one instantiation of the te mplate that no template instantiation happens more than fu nction or static data member per application. This once in the program and that all required instantia­ responsibility can necessitate a considera ble amount of tions are satisfied. Due to these requirements, the work. However, the compiler and linker worki ng user cannot usually specif)' this option on more than together can provide effective template instantiation one source-file compilation in the program. This without specificuser direction. option can also cause the instantiation of templates In the fo llowing section, we present the various that are not used by the program . approaches that can be used for template instantiation. • Instantiate Used Te mplates. A command-line option Te mplate Instantiation Te chniques can be used to direct the compiler to instantiate only those template entities that are used by the

Te mplate instantiation techniques can be broadly cat­ source code and whose definitions arc known at egorized as either manual or automatic. vV ith manual com pilation. As in the previous technique, the user instantiation, the compilation system responds to user must ensure that no template instantiation happens directives to instantiate template entities. These direc­ more than once in the program and that all req uired tives can be in the source program, or they may be instantiations arc satisfied . Due to these require­ command-line options. With automatic instantiation, ments, the user cannot usually specifY this option the compilation system, including the linker, decides on more than one sou rce -file compilation in the which instantiations are req uired and attempts to pro­ program. vide them t(Jr the user's application. • Instantiate Used Te mplates Locally. This command­ line option works like the instantiate used templates Manual Instantiation option, except that it defines each te mplate instan­ Manual template instantiation is the act of manually tiation locally in the current compilation . This option specifYing that a template should be instantiated in the has the advan tage of providing complete te mpbte ti le that is being compi led. This instantiation is given instantiation coverage fo r the program, as long as global external linkage, so that ref erences to the the definitjons of the used templates are avai lable in instantiation that are made in other tiles resolve to this each module. Since all template instantiations are template instantiation. Manual te mplate instantiation given local scope, there is no potential problem includes explicit instantiation requests and pragmas as with multiply defined instantiations when the wel l as command-line options. program is linked. The major problem with this technique is that the user's application can be Explicit Insta ntiation Requests and Pragmas The unnecessarily large, si nce the same template instan­ compilation system instantiates those te mplate entities tiations could appear within multiple object fi les that the userspecifies tor instantiation . The specification used to link the application. This technique wi ll fai l can be made using the C++ explicit template instantia­ if the instantiations must have global scope such as tion syntax or may be made using implementation- a class's static data members.

Digital T,·,hni,al Journal Vo l. 10 No. l 1998 23 Figure 1 shows an example or' a template fu nction, Each instantiation is placed in the communal data sec­ template_func, that contains a locally defined static tion (COMDAT) of the current compilation's object variable. Asshown in the figure, the object files ofboth file.Each object filecontains a copy of every template A and B contain local copies oftemplate_func instanti­ instantiation needed by that compilation unit. ated with int. Each instance of template_func COMDATs are sections that have an attribute that tells defines its own version of static variable x. In this case, the linker to accept, without issuing a warning, multi­ directing the compiler to instantiate used templates ple definitions of a symbol defined in the section.' If locally yields a diffe rent result than instantiating all or more than one object filedefines that symbol, only the used templates globally. section from one object file is linked into the image Ifwe give the static data members global scope and and the rest are discarded, along with all symbols in ensure that they are properly defined and initialized by the symbol table defined in the discarded section con­ executable code rather than by static initialization, we tribution. At link time, the linker resolves an instantia­ can solve the static data members problem. The appli­ tion reference by choosing one of the instantiations cation, however, remains unnecessarily large, because defined in an individual object file's COMDAT. The multiple copies of the instantiated templates can be resulting user's application executable has a single present in the executable. copy of each requested instantiation. vVhensuch linker support is not available, another Automatic Instantiation mechanism must be used to control compile-time Automatic template instantiation relieves the user of instantiation. One such approach is to use a repository the burden of determining which templates must be to contain the generated instantiations. The compiler instantiated and where in the application those instanti­ creates the instantiations in the repository instead of ations should take place. Automatic template instantia­ the current compilation's object file.At link time, the tion can be divided into two categories: compile-time linker includes any requested instantiations fr om the instantiation, whereby the decision about what should repository. As a performance improvement, the com­ be instantiated is made at compile time, and link-time piler can also decide whether an instantiation needs to instantiation, whereby decisions about template instan ­ be generated from the state of the repository. If the tiation are made when the user's application is linked. requested instantiation is in the repository and can be In both cases, specific]ink-time support is needed to determined to be up to date, the compiler does not select the required instantiations fo r the executable. need to regenerate the instantiation.

Compile-time Instantiation Two major techniques Link-time Instantiation The decision to instantiate can can be used to perform automatic template instantia­ be leftunt il link time. The linker can findthe instantia­ tion at compile time. The choice between the two tions that are needed and direct the compiler to generate depends upon the fa cilities available in the linker. those instantiations. McCluskey describes one link-time MicrosoftVisual C++ instantiates templates at compile instantiation scheme.'.r' The compiler logs every class, time using a strategy similar to the instantiate used union, struct, or cnum in a name-mapping file in a repos­ templates command-line option described previously. ' itory. Every declared template is also logged in the name-

II templ e . h:'" linclu e ciostream . h template class T vo i d te�plate_func IT p) { sta ic 'J' :< � 0; cou t << x .,. p; X++;

I / A . c :<:{ / / B . c::< x #include • empl a t e . h x x " " i nc lude " Lempl te . hxx" exte n vo ic� b_func {); vo id b_func ( vo id) int ma in {) { ( I I ... template_func (lO I ; templ� e_f unc ( 20) ; b_func ( ) ; I I ... re urn 0 ;

Figure 1 Te mplate Function Containing a Locally Ddinecl Static Va riable

24 Digital Te chnical Journal Vol. 10 No. I 1998 mapping file. At link time, a prelinker determines which I* perfor�so e_ f unct i on (C&} */ template instantiationsare required. The prelinker builds #.include " empl a te . hxx" � i nc l ude " temp lat e . cx x " temporary instantiation source files in the repository to U i nc lude · c_class . h" satisfY the referenced instantiations, compiles them, and adds the resulting object files to the linker input. Consider the example in Figure 2. Figur e 3 During the compilation of main.cxx, a name­ Example of an Instantiation Sou rce File mapping fileis built in the repository and the location of the user-defined class C and tJ1e flmction template, perform_some_function, are recorded. From tJ1e infor­ sponds to the particular source file that can success­ mation stored in the name-mapping file, an instan­ fully instantiate the user's request. Compiling and pre­ tiation source file is men created in me repository. linking the program used in Figure 2 generates an Figure 3 shows the contents of tJ1e instantiation source instantiation assignment file fo r main.cxx. This ti le file createdto satisfYperf orm_some_fu nction. contains information concerning the command-line The prelinker tJ1en compiles me instantiation source options specified,me user's current working directory, file by invoking the compiler in a special directed mode, and a list of instantiations mat should be instantiated. which directs the compiler to generate code only fo r Main.cxx now owns the responsibility of instantiating specific template instantiations that are listed on the perform_some_flmction. The prelinker recompiles command line. The compiler then generates the defin­ tJ1e source files, such as main.cxx, tJ1at have changes in ition of perfonn_some_flmction in the resulting their template instantiation assignments. The process object file. The resulting object now satisfies the is repeated until there are no changes made to the instantiation request and is included as part of the instantiation assignments. Then the final link can be application's final .l ink. To build the instantiation completed. source files easily, the implementation of this scheme This approach has the advantage of requiring no generally requires mat template declarations, template special file structure to support automatic template definitions, and any argument types used to instantiate instantiation. It is generally fa ster and simpler than a class or function template must appear in separate, McCluskey's approach, because fe wer files are com­ related header files. piled in the generation of the needed instantiations The Edison Design Group has developed anomer and the instantiations are generated in the context of approach to link-time instantiation.7 In this approach, the user's source code. In addition, the assignment of tJ1e compiler records where template instantiations are instantiations to source filescan be preserved between used and ·wh ere they can be instantiated . At link time, recompilations of the source code, so that unless the a pre linker assigns template instantiations by recording strucmre ofthe applicationchanges, the needed instanti­ the assignments in a specially generated filethat corre- ations\viU be available wimout additional recompilation.

IIC_c lass . h xx: cl ss C { public : I I ... } ;

1 /template. hxx templ ace

1 / templa te . cxx templ a e vo id perf o rm_s ome_ func ion (T & param l ( }

llma in . c x x h nc l e "C_c las s . hxx" h ncl· de " empla e.hxx"

int ma in () { C C ; perfo m_some_ uncti on ( ); re rn 0 ;

Figur e 2 Example of a Li nk-time Instantiation Sc heme (McCluskey)

Digital Technical Journal Vol. 10 No I 1998 25 Comparison of Manual and Automatic Instantia tion tiated in A even though the executable never refer­ Te chniques enced anything explicitly defined in file A. This can The manual instantiation techniques require planning yield an unnecessarily large executable. on the part of the user to ensure that needed instantia­ In the next section, we review the template instan­ tions are present, that no extraneous instantiations are tiation support in earlier versions of DIGITAL C++ generated, and that each needed instantiation appears and then discuss the rationale and design of the auto­ exactly once within the application. Witl1 manual matic template instantiation fa cility in version 6.0 of instantiation, the user has the advantage of gaining DIGITAL C++. explicit control over aU template instantiations. Almough the strategy of instantiating used templates DIGITAL C++ Te mplate Instantiation Experience locally requires less planning, it does so at the cost of object file size and tl1e restricted use of templates when As the use of C++ templates has grown, DIGITAL static data members are present or when static data is C++ has been enhanced to support the need for defined locally within a function template instantiation. improved instantiation techniques. The initial release Automatic template instantiation provides template of DIGITAL C++ occurred before the C++ standard­ instantiation wim no explicit action on the part of the ization process had matured, so that the language sup­ user. Compile-time instantiation requires either spe­ ported was based on The Annotated C+ + Reference cific linker support to select a single template instanti­ Ma nual, referred to as the AR.t\1.8 The ARM defined ation fr om potentially many candidates, or support by template fimctionality, but it did not provide guidance the compiler to generate instantiations in separate fo r either manual or automatic template instantiation. object files while compiling the user's source code. Thus it was necessary to provide a DIGITAL C++­ Relying on linker support allows the compiler to effi­ specificmechanism fo r template instantiation. ciently generate instantiations at the cost of larger object files;however, tl1e user loses control over which DIGITAL C++ Manual Te mplate Instantiation instantiation is used in the executable file. Although The #pragma define_template directive and the instan­

the use of separate instantiation object files usually tiate all command-line option, -define_templates, have takes more time at compilation than tl1e linker-support been supported since the initial release of DIGITAL memod, it results in more compact object filesand can C++. provide the user wim more control over which instan­ In Figure 4, tl1e define_template pragma directs the tiation is used in the executable file. compiler to instantiate class template, C, with type int. Link-time instantiation provides template instan­ When the compiler detects the use of the pragma, it tiation that is tailored to the needs of the executable creates an internal C type node and traverses the file. The primary cost is link-time performance, since list of static data members and member functions generation of instantiations occurs at link time. defined within tl1e class. If the definitions of these Another disadvantage oflink-time instantiation can be members are present at tl1C point me pragma is speci­ observed when building object-code libraries. Either fied,the compiler materializes each with typeint. the library must contain all the instantiations that it As the C++ language developed and template usage requires, or the user who wants to link with the ubrary increased, users fo und manual template instantiation must have access to all the machinery to create instan­ to be very labor intensive and requested an automated tiations. Creating a library's instantiations involves method. extra steps during library construction. All the object files to be included in the library must be prelinked, DIGITAL C++ Version 5.3 Automatic Te mplate so tlut the needed instantiations are generated. If Instantiation instantiations are included in the individual object Automatic template instantiation capability became a files in the library, as in the Edison Design Group serious issue during the planning stages of DIGITAL approach, unintended modules may be linked fr om C++ version 5.3. The use of templates was increasing the library to provide the needed instantiations. rapidly, and many new third-parry libraries, such as Consider the fo llowing scenario, in which object Rogue Wave Software'sToo ls.h++, contained a signif­ files A and B are included in tl1e library. Both files icant use of templates. Due to this growing need, the require tl1e instantiation ofperf orm_some_function. requirements were straightforward. The support had V/hen these files are prelinked, the instantiation of to be easy to use, have a short design phase, be quickly perform_some_function is assigned to one of implemenrable on both the DIGITAL UNIX and the the files, say A. If an application that is being linked OpenVMS platforms, and provide reasonable perfor­ against the library requires that the object file B be mance. Because McCluskey's approach had been used linked into tl1e executable, men the object file A is also in several implementations, it presented itself as our linked . Here tl1e instantiation needed by B was instan- best option.

26 Digiral Technical Journal Vo l. 10 No. 1 1998 lass emp la t e < T class c { p blic : void mem_f nc1 {T p ) ; void mem_f •nc2 { T p) ;

) ;

t mp l e cl a ss T> vo id C : : mem_Eunc l ( T p) I I ...l t emplate void C : : mem_func2 (T p) I I ...l

lrprag a de f " ne_ e mpl ate C < i nt>

Figure 4 The define_template Pragma

DIGITAL made two major changes to McCluskey's approach, the name-mapping file contained definition approach to take advantage of the DIGITAL C++ locationsof class B and class C. However, it did not con­ compiler design. First, we allowed instantiation tain any indication that class C had a data member that source files to be created at compile time instead of relied on the definition of class B. From the information link time. This eliminated the need for McCluskey's in the name-mapping file, the pre linker then created an name-mapping fi le and simplified the prelinking instantiation source filethat included only C_class.hxx, process considerably. Since the needed source files Buffer.hxx, and Buffer.cxx. When this instantiation existed in the repository, there was no need to decon­ source file was compiled, an error resulted complaining struct the required template instantiations to deter­ that B is an undefined type whose size is unknown. mine their arguments and types. We solved this problem in DIGITAL C++ version The second change addressed the transitive closure 5.3 by including all the top-level header files included problem. Figure 5 shows an example of the class tem­ by the current compilation unit in any instantiation plate Buffe r being instantiatedwith the user-defined type source files created. This ensured that B_class.hxx C. After compilation of app.cxx with the McCluskey would be included in the generated instantiation file.

IIB_c lass . h xx I IC_c las s . hxx class B { I I . . . J; c las s C { B data_mem ; p bl i c : I I ... ) ;

I I Bu f fer . h xx liB ffer . cxx emplate c l ass Bu ffer { template void Bu ffer T> : : a dd_ite m ( T *p ) { ) T * u ffer ; int num_o f_i tems ; p blic : vo " d add_i tem (T *) ; I I ... ) ;

llapp . cxx #include " B_c l ss . hxx" � · nclude " C_cl a s s . hxx " Nin c lude " Bu f fer . hxx "

vo i d f ( void ) { C c; B ffer c_bu ffer; c_ bu ffe r . a dd_ i t em ( & c ) ;

Figure 5 Instantiation of the Class Te mplate Buffe r

Digital Technical Jouriul Vol. 10 No. l 1998 27 Despite the fa ct that this type of automatic link­ user's object files to r information concerning which time instantiation scheme was being widely used modules could instantiate requested templates. Similar in the industry, the results of using a modified functionality would need to be implemented fo r the McCluskey approach were mixed . Stroustrup has OpenVMS platform. described the general problems with McCluskey's We preserved the concept of the template reposi­ approach.9 We fo und that our implementation suf­ tory as a directory that contains the individ ual tem­ fered particularly fr om poor link-time performance plate instantiation ob;ect fi les. The repository stores and so did not satisfy our users' needs. one object file to r each template function, member fu nction, static data member, and virtual table that is DIGITAL C++ Ve rsion 6.0 Automatic Te mplate generated by automatic template instantiation. The Instantiation file name ofthe instantiation object fileis derived fr om DIGITAL C++ version 6.0 is a complete reimpJemen­ the name of the instantiation's externalna me. At com­ tation of DIGITAL C++, with emphasis on ANSI C++ pile time, the front end generates intermediate code conformance. It is implemented using a completely for aJI templates that are needed in the compilation new code base, which includes the industry-standard unit and can be instantiated. A tree walk is pedorrned C++ tl·ont end from the Edison Design Group and a over the intermediate code to find all entities that are standard class library from Rogue Wave. needed by each generated template instantiation. The From our experience with template instantiation code generator is called to generate code fo r the user­ in DIGITAL C++ versions 5.3 through 5.6, we con­ specified object ti le and is then called repeatedly fo r cluded that the most important issue that should each template instantiation to generate the instantia­ be addressed in the design and implementation of tion object files in the repository. the automatic template instantiation fa cility was the The compiler generally considers an instantiation to compile- and link-time performance. The primary be needed when it is referenced from a context that is goal w:ts to have the performance of automatic tem­ itself needed, such as in a function with global visibility or plate instantiation substantially exceed the perfor­ by the initialization of a vatiable d1at is needed. Virtual mance of version 5.6. Another important goal was member fi.mctions are needed when a constructor fo r to remove the restriction of template declaration and the class is needed. Thus, ail virtual .fi.mction definitions definition placement in header files. In :�ddition,the should be visible in a compilation unit that requires a automatic template instantiationfa cility in version 6.0 constructor fo r d1e class. Each instantiationd1 at is gener­ had ro be culturally compatible with the previous ated ''�th autom:.1ticinstantiation is marked as potentially implementation . The user had to be able to move being in its own object file in the repository. sources and objects to diffe rent directories, easily The intermediate representation of each generated build archived and shared libraries, share instantia­ instantiation is walked to determine what other entities tions between various applications, and have error it references. At this point, the instantiation is a candi­ diagnostics reported at the earliest possible moment in date to be generated in its own object file, but it can the instantiation process. sometimes be generated as part of the user-specified object file. If the instantiation references an entity that Design and Implementation We decided to use a is local to the compilation unit, such as a static func­ compile-time instantiation model as the basis fo r our tion, and that local entity is nonconstant and statically implementation . Since we were using the Edison initialized , the instantiation is merged into the user­ Design Group's front end, we seriously considered specified object file rather than generated in its own using their link-time model. However, the compile­ object file. As an :�lternative, we could have chosen to

time model seemed advantageous tor several reasons. change the loc:tl entity into a global entity with :-� First, there are significantcomplications (as described unique name and generate the instantiation in its own in the section Comparison of Manual and Automatic object file.We chose not to do this in order to make it Instantiation Techniques) when trying to build easier to share a repository between applications. With libraries with a compiler that uses the Edison Design this alternative, the instantiation in the repository Group link-time model. In addition, the link-time requires the object filecontai ning d1e local entity's def model requires recompilations that limit performance inition, which may be in another application. Note that in many typical cases of template use. We recognized any application that contains more d1an one definition that the link-time model could provide better pertor­ of the same instantiation that references a nonconstant mance in some cases, but these would be in the minor­ local entity is a nonstandard -conforming application. ity. Finally, the implementation of the link-time model This is a violation ofd1e one definition rule w Consider would require substantially more implementation the fo llowing code fragment:

eftort on the Open VMS platform. The version of the static int j ; Edison Design Group fr ont end being used to build templa e

28 Digir�l Tec hnical Journal Vol. 10 No. l 1998 The reference to the static variable j in the template stamp. When generating an instantiation, the compiler function, time, prevents the template fi·om being gen­ looks i.n the repository to see if the instantiation object erated into its own object filein the repositorv. fileexists. If it does not exist, it is generated. If the file When the individual instantiations are w�lked, we already exists, its modificationti me is compared to the mark each global entity that is defined in the compila­ time stamp. If the modification time is later than the tion unit so that the definition is replaced by an exter­ time stamp, the instantiation is assumed to be up to nal reference when the instantiation object file is date and is not regenerated. Otherwise, the instantia­ generated. Consider the fo llowing code fragment: tion is generated. The user can control the generation of instantiation object tiles by changing the modifica­ voi rin _coun t(const chal * s, int ivar) { tion time of the TIMESTAMP file. cout<< s <<" : " << ivar ; The time-stamp option would typically be used in a makefile or a shell script that compiles and builds templa e void func (T ar ) an entire application. Before invoking make or the { shell script, the user would make certain that no s a ic int coun: = 0; TIMESTAMP file resided in the repository. This pt-in _co nt { " co n ::. ", count++) ; would ensure that each needed instantiation would be generated exactly once during all the compilations The fi.mction, print_count, is defined in the source done by the build procedure. file:m d generated as a defined function in the user­ Much of the C++ linker support in version 5.6 was specifiedobject file.The template function, fu nc, refer­ reused with only minor modifications fo r version ences the function, print_count. When the code fo r �.0. The compiler is presented with a single repository fi.mc is generated in its own object file,the rderence to mto whtch the instantiation object files are written. print_count must be changed from a rderence to a Multiple repositories can be specified at link time, and definedh.mction to a reference to an external fu nction. each can be searched fo r instantiations that are needed By default, each needed instantiation is generated by by the executable ti le. The linker is used in a trial link every compilation that requires the instantiation. This mode to generate a list of all the unresolved external is the safe default because it ensures that instantiations r�fe rences. This list is then used to search the reposito­ in the repository are up to date. However, there will nes to find the needed instantiation fiks, and tl1e prob:�blybe some compilationoverhead fi-om regener­ process is repeated until no more instantiations are ating instantiations that may already be up to date. We needed or can be satisfied from the repository. The believed that the overhead of regeneratin(Tb instantia- . lmk then proceeds as any normal link, adding the list nons would typically be relatively smaJI. For applica- of tnstantiation object files to the list of object tiles tions with a high overhead of instantiation, such as a and libraries as specified by the user. large number of source files using the same large num­ If a vendor is cre:�tinga library rather tl1an an exe­ ber ofte mplate instantiations, we provided a compila­ cutable file,the instantiations needed by the modules tion option to control the generation of template in the _library can be provided in either oftwo ways: ( 1) instantiations to improve compile-time performance. The hbrary vendor can put the needed instantiations The generation of instantiation object files only in the libra:y by adding tJ1e files in the repository to when they are actually required is a difficultproble m. the hbrary hle. (2) The library vendor can provide the Fine-grain dependency information would have to be repository with the library and require that library kept fo r each instantiation object file. Such depen­ users lmk WJth the repository as well. Note that instan­ dency information would need to rdlect those fiJes that tiations placed in the library :u·e fixedwhen the library are required to successfully generate the instantiation IS created. Smce the library is included in the trial link and record which command-line options the user speci­ of an application, any instantiation in the library takes fied to the compiler. vVe suspected that the overhead precedence over the same named instantiatia"n in a involved with gathering and checkjng the information repository. might be an appreciable percentage of the time it wouJd take to do the instantiation, and thus it would not give Results In a number of tests, DIGITAL C++ version us the performance improvement that we wanted. 6.0 showed improved performance over version 5.6. Instead, we decided to provide an option that allows We tested a variety of user code samples that use tem­ the user to decide when instantiations are generated . plates to varying degrees and fo und that build times tor We rder to this as the template time-stamp option, version 6.0 decreased substantially compared to tl1e -mmestamp. When using the time-stamp option, the version 5.6 compiler. Examples of two typical C++ compiler looks 111 the repository fo r a file named applicationsused in our tests are the publicly available TIMESTAl\1P.If the file is not fo und, it is created. The EON ray-tracingbe nchmark and a subset of tests from modification time of this ftleis referred to as the time our Standard Te mplate Library (STL) test suite. For

Digital Technical Journal Vol. 10 No. I 1998 29 the EON benchnurk, the build time fo r version 6.0 was ture of the ti les used to generate the instJntiation. For 28 reduced to percent of the build time tor version 5.6. example, if the user specified Jn include directory For the STL tests, the build ti me to r version 6.0 was of old_include on the initial compibtion and later reduced to 19 percent of the build time fo r version 5.6. specified J.n include directory of new_include, this The number of files in the repository also decreased approach wo uld not recognize that diffe rent files were signiticm tly because version 6.0 generates only instan­ being included. tiation object files instead of the instantiation source, Another approach to improving application build command, dependency, and object filesof\-crsion 5.6. performance is to support a build fa cility that can For EON, the version 6.0 repository contained 88 files make use of template inf(m11Jtion in determining compared to 260 files in version 5.6. dependency. Currently, each user-specified object fi le Using the ti me-stamp option, build time to r the is dependent on :�II the included fi les necessary to EON benchmark was reduced by only 5 percent co m­ create instantiation object fi les f( >r te mplate req uests. pared to the dcfJult instantiation strJtegy. The real When a change is made to a te mpbte definition, all the benefitof the time-stamp option comes with applica­ sources that reference the te mplate need to be recom­ tions that usc the same template instantiations in many piled. A build fa cility designed to be sensitive to tem­ compilation units. For example, in one user's test case, plate instJntiation could detect that a change in the build times dropped fr om roughly 18 hours with the template definition was limited to the instantiation default instantiation to 3 hours when using the time­ object file. It could then instruct the compiler to sup­ stamp option. press the regeneration of object files to r sou rce fi les In tl1e next section, we conclude our paper with a dis­ that are only being recompiled due to the ci1Jnge in cussion of furtl1er work that can improve the perfor­ the te mplate instantiation . Such a f.1cility could also mance and usability of automatic template instantiation. suppress the recompiiJtion of any source file thJt would only reproduce the changes to instantiations Future Research that were already regenerated. Because we recognize that link-time instanti:�tion We continue to investigate approaches and techniques can pertonn better in some cases than the compile-time to improve tl1e usJbility and performance of the auto­ approach, we Jre investigating the link-time inst:�ntia­ matic template instantiJtion fa cility. Optimal usJbility tion model as a user option. and performance would seem to require a development Finally, we continue to look at ways to reduce the environment completely intq!;rJted fo r C++. This envi­ cost of generating each instantiation. For example, by ronment would keep track of all entity definitions Jnd default the compi ler compresses the generated object usage

30 Disiral Tcchni-::11 Journal Vo l. 10 No. l J 998 tion occurs. In addition, it provides a substantial Biographies improvement in performance of template instantiation over version 5.6 and reduces the restrictions on the Avrwn E. Itzkowitz location of template declarations and definitions. We Avrum Itzkowitz was a contractor/consultant at DIGITAL fr om September 1995 through December 1997. During continue to investigate the template-instantiationimple­ that time, he worked as part of the DIGITAL C++ develop­ mentation to further improve compile- and link-time ment team, designing and implementing much of the sup­ performance and ease of use. port fo r the automatic template instantiation fa cility in DIGITAL C++ version 6.0. Avrum also designed and Acknowledgment implemented template instantiation tests. He is currently a senior software architect engineer at GTE lnternetworking. He holds a B.S. ( 1972) in electrical engineering fr om The authors wish to acknowledge Bevin Brett, who Northwestern University and M.S. ( 1976) and Ph.D. contributed substantially to the design and implemen­ ( 1979) degrees in computer science fr om the University tation of the needed walk and instantiation object file of lll.inois. Avrum is a member of the ACM, the IEEE­ generation fo r DIGITAL C++ version 6.0, and Computer Society, and SIGPLAN . Hemant Rotithor, who provided the performance measurements fo r DIGITAL C+ + version 6.0 versus version 5.6. The authors also wish to acknowledge Charlie Mitchell, Coleen PhiUimore, Rich Phillips, and Harold Seigel fo r their contributions to the design and implementation of the DIGITAL C++ automatic tem­ plate instantiation.

References

l. ISO/IEC Standard 14882, C++, 1998.

2. B. Stroustrup, Tbe C++ Programming Language, Lois D. Foltan Third Edition (Reading, Mass.: Addison-Wesley, 1997). Lois Foltan is a principal software engineer at Compaq. Her areas of expertise include support fo r C++ automatic 3. Microsoft Visual C++ 5.0, On-line Help, "Templates, template instantiation and the DIGITAL C++ object C++." model. She was a member of the DEC C/C++ compiler team fo r eight years. During that rime, she contributed 4. 1\tl icrosoft Corporation, "Microsoft Portable Exe­ to the first GEM-based DEC C and DEC C++ compilers. cutable and Common Object File Format Speciftca­ Recently, she joined the Digital Java ream. Lois received a tion," Revision 5.0, Section 5.5.6, Microsojt B.S. in computer science fr om the UniversityofV ermont Deueloper�· Network (October 1997). in 1988.

5. G. McCluskey, "An Environment tor Template Instan­ tiation," Tbe C++ Report, voL 4, no. 2 ( 1992 ) .

6. G. McCluskey and R. Murray, "Template Instantiation for C++," Sigplan No tices, vaL 27, no. 12 (1992): 47-56.

7. Edison Design Group, "Template Instantiation in the EDG C++ Front End," Note to the Al'\JSI C++ Com­ mittee, X3Jl6/95-0l63, WG21/N0763.

8. M. Ellis and B. Stwustrup, 77.? e Annotated C++ Refe r­ ence /VIa nual (Reading, Mass.: Addison-Wesley, 1990).

9. B. Stroustrup, Tbe Design a.nd Evolution of C++ (Reading, Mass.: Addison-Wesley, 1994): 366.

10. B. Stroustrup, Th e C++ Programming Language, Third Edition (Reading, Mass.: Addison-Wesley, 1997): 203-205.

Digital Technical Journal Vol. lO No. 1 1998 31 I

Hemant G. Ro tithor Kevin W. Harris Mark Davis Measurement and W. Analysis of C and C++ Performance

As computer languages and architectures Optimizing compilersare becoming ever more complex evolve, many more challenges are being pre­ as languages, target architectures,and product fe atures sented to compilers. Dealing with these issues evolve. Languages contribute to compiler complexity with their increasing use of abstraction, modularity, in the context of the Alpha Architecture and the delayed binding, polymorphism, and source reuse, C and C++ languages has led Compaq's C and especially when these attributes are used in combina­ C++ compiler and engineering teams to develop tion. Modern processor architectures are evolving ever a systematic approach to monitor and improve greater levels of internal parallelism in each successive compiler performance at both run time and generation of processor design. In addition, product compile time. This approach takes into account fe ature demands such as support fo r fa st threads and other fo rms of external parallelism, integration with five major aspects of product quality: function, smart debuggers , memory use analyzers, performance reliability, performance, time to market, and analyzers, smart ectitors, incremental builders, and feed­ cost. The measurement framework defines a back systems continue to add complexity. At the same controlled test environment, criteria for select­ time, traditional compiler requirements such as stan­ ing benchmarks, measurement frequency, and dards conformance, compatibility with previous ver­ a method for discovering and prioritizing oppor­ sions and competitors' products, good compile speed , and reliability have not ctiminishcd. tunities for improvement. Three case studies AU these issues arise in the engineering of Compaq's demonstrate the methodology, the use of mea­ C and C++ compilers fo r the Alpha Architecture. surement and analysis tools, and the resulting Dealing with the m req uires a ctisciplined approach to performance improvements. performance measurement, analysis, and engineering of the compiler and libraries ifconsistent improvements in out-of-the-box and peak performance on Alpha proces­ sors are to be achieved. In response, several engineering groups working on Alpha software have established procedures fo r fe ature support, performance measure­ ment, analysis, and regression testing. The operating system groups measure and improve overall system performance by providing system-level tuning fe atures and a variety of performance analysis tools. The Digital Products Division (DPD) Performance Analysis Group is responsible fo r providing official performance statistics fo r each new processor mea­ sured against industry-standard benchmarks, such as SPECmarks published by the Standard Performance Evaluation Corporation and the TPC series of transac­ tion processing benchmarks fr om the Transaction Processing Performance Council . The DPD Performance Analysis Group has established rigorous methods fo r analyzing these benchmarks and provides perfor­ mance regression testing fo r new software versions.

32 Digital TechnicalJournal VoL 10 No. 1 1998 Simibrly, the Alpha compiler back-end development For example, to r the performance aspects, goals along group (GEM) has established performance improve­ the fo llowing lines are common : ment and regression testing procedmes fo r SPECmarks; • Optimizations should not impose a compile-speed it also performs extensive run-time performance analy­ penalty on programs fo r which they do not apply. sis of new processors, in conjunction with refiningand developing new optimization techniques. Final ly, con­ • The use of unrelated compiler fe atures should not sultants working with independent software vendors degrade optimizations. (ISVs) help the ISVs port and tune their applications • New optimizations should not degrade reliability. to work well on Alpha systems. • New optimizations should not degrade perfor­ Although the effort from these groups does con­ mance in any applications. tribute to competitive performance, especially on • Optimizations should not impose any nonlinear industry-standard be nchmarks, the DEC C and C++ compile-speed penaJty. compiler engineering teams have fo und it necessary to independently monitor and improve both run-time • No application should experience run-time speed and compile- time performance. Jn many cases, ISV regreSSIOnS. support consultants have discovered that th eir applica­ • Specificbenchmar ks or applications should achieve tions do not achieve the performance levels expected specificrun-t ime speed improvements. based on industry-standard benchmarks . We have seen • The use of specificnew language fe atures should not a variety of causes: New language constructs and prod ­ introduce compile-speed or run-time regressions. uct fe atures are slow to appear in industry bench­ marks, thus these optimizations have not received In the context of pertormance, the term measure­ suffi cient attention. Obsolete or obsolescent source ment usually refers to crude metrics collected during code remaining in the bulk of existing applications an automated script, such as compile time, run time, Gl uscs default options/switches to be selected that or memory usage. The term analysis, in contrast, inhibit optimizations. Many of the most important refers to the process of breaking down the crude mea­ optimizations used fo r exploiting internal parallelism surement into components and discovering how the make assumptions about code behavior that prove to measurement responds to changing conditions. For be wrong. Bad experiences with compiler bugs induce example, we analyze how compile speed responds to users to avoid optimizations entirely. Configuration an increase in available physical memory. Often, a and source-code changes made just before a prod uct is comprehensive analysis of a particular issue may released can interfere with important optimizatio ns. require a large number of crude measurements. The For all these reasons, we have used a systematic goal is usually to identi�r a particular product fe ature approach to monitor, improve, and trade off five or optimization algoritl1m tlut is fa iling to obey one of major aspects of product quality in the DEC C and the product goals, such as those listed above, and DIGITAL C++ compilers. These aspects are fu nction, repair it, replace it, or amend the goal as appropriate. reliability, performance, time to market, and cost. As always, individual instances of this approach are Each aspect is chosen because it is important in isola­ interesting in themselves, but the goal is to maximize tion and because it trades off against each of the other the overall performance while minimizing the devel­ aspects. The objective of this paper is to show how the opment cost, new feature availability, reliability, and one characteristic of performance can be improved time to market fo r the new version. wh ile minimizing the impact on the other fo ur aspects Although some literature'-4 discusses specific aspects of prod uct qual ity. of analyzing and improving performance of C and C++ Jn th is paper, we do not discuss any individual opti­ compilers, a comprehensive discussion of the practical issues involved in the measurement and analysis of mization methods in detail; tl1ere is a plethora of liter­ . ature devoted to these topics, including a paper compiler performance has not been presented in the published in this jo urnal.' Nor do we discuss specific literature to our knowledge. In this paper, we provide a compiler product fe atures needed tor competitive sup­ concrete background tor a practitioner in the field of port on individual platforms. Instead, we show how compilation-related performance analysis. the efforts to measure, monitor, and improve perfor­ In the next section, we describe the metrics associ­ manC<: are organized to minimize cost and time to ated with the compiler's performance. Following that, market while maximizing function and reliability. we discuss an environment to r obtaining stable perfor­ Since al l these prod uct aspects are managed in tJ1e con­ mance results, includ ing appropriate benchmarks, text of a series of product releases rather than a single measmement freq uency, and management ofthe results. release , our goals are frequently expressed in terms of FinalJy, we discuss the tools used fo r performance mea­ relationships between old and new product versions. surement and analysis and give examples of the use of tl1ose tools to solve real problems.

Digital Technical Joumal Vo l. lO No. l 1998 33 Performance Metrics and fo r instantiation of templates. In addition, fo r debug versions of the re sult files, it is essential to In our experience, ISVs :m d end users are most inter­ finda way to suppress repeated descri ptions of the ested in the fo llowing performance metrics: type information fo r varia bles i n m ulti ple modu les .

• • Function. Although fu nction is not u sually consid­ Co mpiler dynamic memory use. Peak usJge, aver­ ered an Jspect of performance, new language and age usage , and pattern of usage must be regulated product fe atures J.re entirel y appropriate to consider to keep the cost of a minimum development con­ among potenti al pe rformance improvements when figu ration low. In additi on , it is i mportant to ensure trading off development resources. From the point that specific compiler a lgorithms or com binations of view of a user who needs a p:trticu l ar feature , the of them do not viol ate the usage Jssu mptions built Jbsence of that fe Jture is i ndistingu isha ble from an in to the paging system, which can make the system umcceptab ly slow i mpl ementation of that fea ture. u nusable d uring large compilations.

• Reli abili ty. Academic papers on pcd(x mance sel ­ Crude measu rements can be made t(Jr all or most of dom discuss relia bility, but it is cru ci al. Not only is these metrics in a single script. When Jttempting to an unreliable optim ization useless , often it preju­ make a si gnifica nt improvement in one or more mct­ dices progr:� mmers against using any op ti miza­ rics, however, the ch ange often necessarily degrades tions, thus degradi ng rather than enh arK i ng overall others. This is acceptable, as long as the only cases that perform:m cc. pay a penalty (e .g. , in l arger dyr1:1mic memory usc) arc the ompilations that benefitfr om the improved r n­ • Application absolu te run time. Typically, the absolute c u ru n ti me of an application is measured fo r a bench­ time performance . As mark with specific input data. It is i mportant to real­ the list of performance metrics ind icatcs, the most ize, however, that a user-su pplied benchmark is often important disti nction is made betvvcen comp ile -ti me only a surrogate for the maximum application size. and run-time metrics. In practice , we use a uton1J tcd scripts to measure compile-ti me Jnd run-time perfor­ • Max i m um applicni on size. Often, the end user is mance on a fa irly frequent (dai ly or weekly during not trying to solve a specific input set in the shortest development) basis. time ; instead , the user is trying to solve the largest possible real-world problem within a specific time. Compile- Time Performance Me trics Thus, trends (e.g., memory bandwid th ) arc often To measure compile- time perform:mce , we usc fo ur more i mportant than absolute timings. This a lso metrics: compilation time,size ofd1e generated objects, impl ies thJt speci fic benchmarks must be retired or dynamic memory usage during compilation, and tem­ upgraded when processor improvements moot their plate instantiation time fcJr C++. original rationale .

• Price/Performance ratio . Often, the most effective Co mpilation Time The compi lation time is measured competitor is not the one who em match our as d1e time it takes to compi le J give n set of sources, prod uct's perform ance, but the one who cJ.n give typically excl ud ing the link time. The link ti me is acceptable performance (see Jbovc) wid1 d1e cheJpcst excluded so that only compiler pcrfonnJncc is mea­ solution. Since compiJcr developers do not contribute sured . This metric is i mportJnt because i t dirccdy direcdy to server or workstation pricing decisions, affects d1e productivity of a develope r. In the C++ c:1sc, they must use d1c previous meuics as surrogates. performance is measured ab initio, because our prod ­

• Compile speed. This aspect is primarily of interest to uct set does not support incremental compilation appl icati on developers rather th Jn end users. below the grJnula rity of ::1 whole module. When opti­ Compile speed is often given secondary considera­ mization of the entire program is atte mpted, this may tion in academic papers on opti mization; however, it become a more i nteresting issue. The UNIX shel l ti m­ can make or breJ k the decision of an ISV consider­ ing tools make a distinction between user and system ing a pl atfo rm or J development environment. Also, time, but iliis is not a meaningfu l distinction f(x a com­ fo r C++, there is an important distinction between pi ler user. Si nce compibtion is typic:�llyCPU intensive ab initio build speed and i nc re mental build speed, and system time is usual ly modest, tr:�c king the sum of due to the need for template instJnriation. both d1<:: user and the system time gives the most realis­ tic result. Sl ow compilation ti mes can be ca used by the • Resu lt filesize . Both the obj ect filean d executable ( al ori h ms i n he opti miza io phases , filesizes arc impor ta nt. This Jspect WJS not a partic­ use of 0 n2) g t t t n frequ tly caused by excessive ular problem with C, but scverJI Ianguage fe atures but they can also be en l ayering or modularity due to code reuse or excessive of C++ :l!1d its optimizations can lcJd to explosive ow h of the in-memory r prese a ion of the pro­ growth in result fi le size. The most obvious prob­ gr t e nt t gram during compilation (e.g., due to inlini ng) . lems Jrc the need fo r extensive function i nli ni ng

34 Digital Tc dmic.�l Jound Vo l. 10 No. I 1998 Size of Generated Objects Excessive size of generated This presents a difficult design chaJlenge: how do we objects is a direct contributor to slow compile and compile large programs using an acceptable amount of link times. In addition to the obvious issues ofinlin­ virtuaJ and physical memory? Trade-offs change con­ ing and template instantiation, du plication ofthe type stantly as memory prices decline and paging algorithms and naming information in the symbolic debugging of operating systems change. Some optimizations even support has been a particular problem with C++. have the potential to expand one of the intermediate Compression is possible and helps with disk space, but representations into a fo rm that grows faster than the 1 this increases link time and memory use even more. size of the program (0( n x log(n)), or even 0( n )). In The current solution is to eliminate duplicate infonna­ these cases, optimization designers often limit the tion present in multiple modules of an application. scope of the transformation to a subset of anin dividual This work requires significant support in both the function (e.g., a loop nest) or use some other meansto linker and the debugger. As a result, the implementa­ artificially limit the dynamic memory and computation tion has been difficult. requirements. To allow additional headroom, upstream compiler phases are designed to eliminate unnecessary Dynamic Memory Usage during Co mpilation Usually portions of the module as early as possi ble. modern compilers have a multiphase design whereby In addition, the memory management systems are the program is represented in several difterent forms in designed to allow internal memory reuse as effi­ dynamic memory during the compilation process. For ciently as possible. For this reason, compiler design­ C and C++ optimized compilations, this involves at ers at Compaq have generally preferred a zone-based least the fd lowing processes: memory management approach rather than either a malloc-based or a garbage-collection approach. A • Retrieving the entire source code tor a module zoned memory approach typically allows allocation fr om its various headers of varying amounts of memory into one of a set of • Preprocessing the source according to the C/C++ identified zones, fo llowed by deallocation of the rules entire zone when all the individual allocations are no • Parsing the source code and representing it in an longer needed. Since the source program is repre­ abstract f(:>rm with semantic information embedded sented by a succession of internal representations in an optimizing compiler, a zoned-based memory • For C++, expanding template classes and fu nctions into their individuaJ instances management system is very appropriate. The main goals of the design are to keep the peak • SimplifYing high-level language constructs into a memory use below any artificial limits on the virtual fo rm acceptable to the optimization phases memory available for all the actual source modules • Converting the abstract representation to a diffe r­ that users care about, and to avoid algorithms that ent abstract form acceptable to an optimizer, usu­ access memory in a way that causes excessive cache ally called an intermediate language (IL) misses or page taults. • Expanding some low-level functions inline into the context of their callers Te mplate Instantiation Time for C++ Templates are a

• Performing multiple optimization passes involving major new teatureof the C++ language and are heavily annotation and transformation of the IL used in the new Standard Library. Instantiation of templates can dominate the compile time of the mod­ • Converting the I L to a fo rm symbolically represent­ ules that use them. For this reason, template instantia­ ing the target machine language , usually called code tion is undergoing active study and improvement, generation both when compiling a module fo r the first time and • Performing scheduling and other optimizations on when recompiling in response to a source change. An the symbolic machine language improved technique, now widely adopted, retains pre­ • Converting the symbolic machine language to actual compiled instantiations in a library to be used across object code and writing it onto disk compilations of multiple modules. Te mplate instantiation may be done at either com­ In modernC and C++ compilers, these various inter­ pile time or during link time, or some combi nation.' mediate f(xms are kept entirely in dynamic memory. DIGITAL C++ has recently changed from a link-time Although some of these operations can be performed to a compi le-time model fo r improved instantiation on a fu nction-by-function basis with in a module, it is performance. The instantiation time is generally pro­ sometimes necessary fo r at least one intermediate fo rm portional to the number of templates instantiated , of the module to reside in dynamic memory in its which is based on a command-line switch specification entirety. In some instances, it is necessary to keep mul­ and the ti me required to instantiate a typical template. tiple tonns of the wholemod ule simultaneously.

Digital Tc chniol Journal Vo l. 10 No. 1 1998 35 Run-Time Performance Me trics commercially avaiL1ble test suite is called NULLSTONE,'' We use automated SC!ipts to measure run-time perfor­ and custom-written tests are used as well. mance tor generated code, the debug image size, the pro­ In a collection ofsuch tests, the total number of opti­ duction image size, and specific optimizations triggered . mizations implemented as a percentage of the total tests can provide a useful metric. This metric can indi­ Run Time for Generated Code The run time fo r gen­ cate if successive compiler versions have improved and erated code is measured as the sum of user and system can help in comparing optimizations implemented in time on UNIX required to run an executable image. compilers from difterent vendors. The optimizations This is the primary metric fo r the quality of generated that are indicated as not implemented provide useful code. Code correctness is also validated . Comparing data fo r guiding fu ture development effort. run times to r slightly differing versions of synthetic The application developer must always consider the benchmarks allows us to test support fo r specitic opti­ compile-time versus run-time trade-off. In a well­ mizations. Performance regression testing on both designed optimizing compiler, longer compile times synthetic benchmarks and user applications, however, are exchanged f( Jr shorter run times. This relationship, is the most cost-effective method of preventing per­ however, is fa r from linear and depends on the impor­ fo rmance degradations. Tracing a pe rrormance regres­ tance of pertormance to the application and the phase sion to a specificcompiler change is often difficult, but of development. the earlier a regression is detected, the easier and During the initial code-development stage, a shorter cheaper it is to correct. compile time is useful because the code is compiled often. During the production stage, a shorter run time Debug Image Size The size of an image compiled is more import

give the most rc li:�ble and consistent results in a Specific Optimizations Triggered In J multiphase controlled environment. A number of ta ctors other optimizing compiler, a specific optimization usually than the compiler performance have the potential of requires preparatory contributions fr om several aHecting the observed results, and the effect of such upstream phases and cleanup from several down­ perturbations must be minimized . The hardware and stre;�m phases, in addition to tbe ;�ctual transforma­ softwarecomponents of the test environment used arc tion. In this environment, an unrelated change in one discussed below. of the upstream or downstream phases may interfere Experience has shown that it helps to have a dedi­ with a data structure or violate an assumption cated machine tor pcdormance analysis and measure­ exploited by a downstream phase and thus generate ment, because the results obtained on the same bad code or suppress the optimizations. The genera­ machine tend to be consistent and can be meaning­ tion of bad code can be detected quickly with auto­ fully compared with successive runs. In addition, the mated testing, but optimization regressions are much external influences can be closely controlled, and ver­ harder to find. sions of system software, compilers, and benchmarks For some optimizations, however, it is possible to can be controlled without impacting other users. write test programs that are clearly representative Several aspects of the hardware configuration on the ;� nd can show, either by some kind of dumping or test machine can aftc ct the resulting measurements. by compar;�tive performance tests, when an imple­ Even within a singl e f:unily of CPU architectures at mented optimization fa ils to work as expected . One comparable clock speeds, differences in specificim ple-

36 Digit:ll T� chnicJI Journal Vo l 10 No. 1 1998 mentations can cause significant pedonnance changes. types and addresses of program variables. This mode The number of levels and the sizes of the on-chip and is commonly specified during code development. board-level caches can have a strong effect on perfor­ 3. Optimize/Prod uction Mode. In the optimize/ mance in a way that depends on algorithms of the production mode, we select the option combina­ application and the size of the input data se t. The size tion fo r generating optimized code ( -0 compiler and the access speed of the main memory strongly op tion) fo r a prod uction image. This mode is most affe ct performance, especially when the application likely to be used in compiling applications bdore code or data does not fit into the cache. The activity on shipping to customers. a network connected to the test system can have an effect on performance; fo r example, if the test sources We prefer to measure compile speed fo r debug mode, and the executable image are located on a remote disk run speed fo r prod uction mode, and both speeds fo r and Jre fe tched over a nenvork. Va riations in the the default mode. The default mode is expected to lose observed performance may be divided into nvo parts: only modest run speed over optimize mode, have good ( 1) system-to-system variations in measurement when compile speed, and provide usable debug information . running the s:une benchmark and (2) run-to-run varia­ tion on the same system nnmi ng the same benchmark. Criteria for Selecting Benchmarks Va riJtion due to hardware resource diffe rences Specific benchmarks are selected fo r measuring perfor­ between systems is add ressed by using a dedicated mance based on the ease of measuring interesting machine to r performance measurement as indicated properties and the relevance to the user community. above. VJ riation due to network activity can be mini­ The desirable characteristics of usefu l benchmarks are mized by clos ing all the applications that make use of • It should be possible to measure individual opti­ the network before the performance tests are started mizations implemented in the compiler. and by using a disk system local to the machine under • It should be possible to test performance fo r com­ test. The variations due to cache and main memory monly used language fe atures. system effects can be kept consistent berween runs by • usi ng similar setups fo r successive runs ofpedormance At least some of the benchmarks should be repre­ measurement. sentative of wid ely used applications. In addition to the hardware components of the • The benchmarks should provide consistent resu lts, setup described above, several aspects of the sof-tware and the correctness of a run should be verifiable. environ ment can affect performance . The operating • The benchmarks should be scalable to newer system version used on the test machine should corTe­ machines. As newer and fa ster machines are devel­ spond to the version that the users are likely to use on oped, the be nchmark execution times diminish. It their machines, so that the users see comparable per­ should be possible to scale the benchmarks on the fo nnance. The libraries used with the compiler are machines, so that usdi.1 l results can still be obtained usually shipped with the operating system. Using dif­ without significant error in measurement. fe re nt libraries can affect performance because newer libraries may have better optimizations or new fe a­ To meet tl1ese diverse requirements, we selected a set tures. The compiler switches used while compi ling test of benchmarks, each of which meets some of the sources can result in different opti mization trade-offs . requirements. vVe grouped our benchmarks in accor­ Due to the large number of compi ler options sup­ dance with the performance meaics, that is, as compile­ ported on a modern compiler, it is impractical to test time and run-time be nchmarks. This distinction is performance with all possible combinations. necessary because it al lows us to fine-tune the contents To meet our requirements, we used the following of the be nchmarks under each category. The compile­ small set of switch combinations: time and ru n-time benchmarks may be fi.rrtherclassified as ( l) synthetic benchmarks fo r testing the petiormance I. Default Mode. The default mode represents the of specific fe atures or (2) real applications tl1at indicate default combinationof switches selected for the com­ typical performance and combine the specificfe anu·es. pilcr when no user-selectable options are specified. The compiler designer chooses tl1e default combina­ Compile-Time Benchmarks Examples of synthetic tion to provide a reasonable trade-off between com­ compile-time benchmarks include the #define inten­ pile speed and run speed. The use oftl1is mode is very sive preprocessing test, the array intensive test, the common, especially by novices, and thus is important comment intensive test, the declaration processing to measure. intensive test, the hierarchical #include intensive test, 2. Debug Mode. In the debug mode, we test the option the printf intensive test, the empty #include intensive combination that the programmer would select when test, the arith metic intensive test, the fi.mction defini­ debuggi ng. Optimizations are typically turned off, tion intensive test (needs a large memory), and the and full sym bolic information is generated about the instantiation intensive test.

Digital Tec hnical Joumal Vol. lO No. 1 1998 37 Re al applications used as compile-time bench­ patibility issues bel:\veen diffe rent C++ dialects. We marks include selected sources fr om the C compiler, also reviewed the results published by other C++ com­ the DIGITAL UNIX operating system, UNIX utilities piler vendors. such as awk, the X wi ndow interface, and C++ class Maintaining a good set of performance measurement inheritance. benchmarks is necessary fo r evolving languages such as C and C++. New standards are being developed tor Run-Time Benchmarks Synthetic run-time bench­ these languages, and standards compatibility mav make marks contain tests fo r individual optimizations fo r some of a benchmark's fe atures obsolete. Updating the diffe rent data type, storage types, and operators. One database of benchmarks used in testing involves run-time suite called NULLSTON£6 contains tests fo r • Changing the source of existing benchmarks to C and C++ co mpiler optimizations; another test suite accommodate system header and default behavior called Bench++7 has tests fo r C++ fe atures such as vir­ changes tual fu nction calls, exception handling, and abstraction penalty (the Haney kernels test, the Stepanov bench­ • Adding new benchmarks to the set when new com­ mark, and the OOPACK benchmark"). piler fe atures and opti mizations are implemented

Ru n-time benchmarks of real applications fo r the C • Deleting outdated benchmarks that do not scale language include some of the SPEC tests that are closely we ll to newer mac hines tracked by the DPD Performance Group. For C++, the In the fo llowing subsection, we discuss the fr e­ tests consist of the groff word processor processing a set quency of our performance measurement. of documents, the EON ray tracing benchmark, the Odbsim-a database simulator fi·om the University of Measurement Fr equency Colorado, and tests that call fu nctions from a search When deciding how often to measure compiler per­ class library. fo rmance, we consider 1:\vo majorfa ctors:

Acquiring and Maintaining Benchmarks • It is costly to track down a specific performance We have esta blished methods of acquiri ng, maintain­ regression amid a large number of changes. In ta ct, ing, and updating benchmarks. Once the desirable it somerjmes becomes more economical to address characteristics of the benchmarks have been identified, a new opportunity instead.

useful benchmarks may be obtained fr om several • In spite of automation, it is still costly to run a suite sources, notably a standards organization such as of pertormance tests. In addition to the actual run SPEC or a vendor such as Nullstone Corporation . The time and the evaluation time, and even with signifi­ public domain can provide benchmarks such as EON, cant efforts to filter out noise, the normal run-to­ groff, and Bench++. The use of a public-domain run variability can show phantom regressions or bench mark may require some level of porting to make improvements. the benchmark usable on the test platform ifthe origi­ nal application was developed fo r use with a difte rent These considerations naturally lead to two obvious language dialect, e. g., GNU's gee. approaches to test frequency:

Sometimes, customers encounter performance prob­ • Measuring at regular intervals. During active devel­ lems ''� th a specific fe ature usage pattern not anticipated opment, measuring at regular intervals is the most by the compiler developers. Customers can provide appropriate policy. It al lows pinpointing specific extracts of code that a vendor can use to reprod uce pertormance regressions most cheaply and permits these performance problems. These code extracts can easy scheduling and cost management. The intcrv::tl fo rm good benchmarks fo r use in fi..tnJre testing to avoid selected depends on the amount of deve lopment reoccurrence of the problem. (number of developers and fr equency of new code Application code such as extracts fr om the compiler check-ins) and the cost of the testing. In our rests, sources can be acquired fr om within the organization . the intervals have been as fi·equent as three days and Code may also be obtained from other software devel­ as infrequent as 30 days. opment groups, e. g., the class library group, the • Measl!ling on demand. Measurement is performed debugger group, and the operating system group. on demand when significant changes occur, for If none ofthese sources can yield a benchmark with example, the delivery of a major new version of a a desirable characteristic, then one may be written component or a new version of the operating system. solely to test the specific te arure or combination. A fi.dl performance test is warranted to establish a In our tests of the DIGITAL C++ compiler, we new baseline when a competitor's prod uct is rclc:�scd needed to use all the sources discussed above to obtain or to ensure that a problem has been corn:cted . C++ benchmarks that test the major fe atures of the la nguage . The public-domai n benchmarks sometimes Both strategies, ifim plemented purely, have problems. required a sign ificant porting effort because of com - Frequent measurement can catch problems early but is

38 Di[.;it�l T<.: dllliCJI JournJI Vol. 10 No. I 1998 resource intensive, whereas an on-demand strategy tions). A moderate improvement would be a 10 to may not catch problems early enough and may not 50 percent increase in the speed of a language fe a­ allow sufficient time to address discovered problems. ture. A small improvement such as loop unrolling In retrospect, we discovered that the time devoted to is worthvvhile because it is common. more frequent runs of existing tests could be better • Difficulty of implementation. We estimate the used to develop new tests or analyze known results resource cost fo r implementing the suggested more fu lly. optimization as difficult, straightforward, or easy. We concluded that a combination strategy is the best Items are classified based on the complexity of approach. In our case all the performance tests are run design issues, total code required, level of risk, or prior to product releases and after major component number and size of testing requirements. An easy deliveries. Periodic testing is done during active devel­ improvement requires little up-front design and opment periods. The measurements can be used fo r no new programmer or user interfaces, introduces analyzing existing problems, analyzing and comparing little breakage risk fo r existing code, and is typically pertormance with a competing product, and finding limited to a single compiler phase, even if it involves new opportunities for performance improvement. a substantial amount of new code. A straightfor­ ward improvement would typically require a sub­ Managing Performance Measurement Results stantial design component with multiple options Typically, the firsttime a new test or analysis method is and a substantial amount of new coding and testing used, a few obvious improvement opportunities are but would introduce little risk. A difficult improve­ revealed that can be cheaply addressed. Long-term ment would be one that introduces substantial risk improvement, however, can only be achieved by going regardless of the design chosen, involves a new user beyond this initial success and addressing the remain­ interface, or requires substantial new coordination ing issues, which are either costly to implement or between components provided by different groups. which occur infrequently enough to make the effort For each candidate improvement on our list, we seem unworthy. This effort involves systematically assign a triple representi ng its priority, which is a tracking the performance issues uncovered by the Cartesian product of the mree components above: analysis and judging the trends to decide which improvement efforts are most worthwhile. Priority = (frequency) x (payoff) x (difficulty) Our experience shows that rigorously tracking all This classification scheme, though crude and subjec­ the performance issues resulting fr om the analyses tive, provides a usefu l base fo r resource allocation. provides a long list of opportunities fo r improvement, Opportunities classifiedas common, high, and easy are fa r more than can be addressed during the develop­ likely to provide the best resource use, whereas those ment of a single release. It thus became obvious that, issues classified as uncommon, small, and difficult are to deploy our development resources most effectively, the least attractive . This scheme also allows manage­ we needed to devise a good prioritization scheme. ment to prioritize performance opportunities against For each performance opportunity on our list, we functional improvements when allocating resources keep crude estimates of three ctiteria: usage frequency, and schedule fo r a product release. payoff from implementation, and difficulty of imple­ Further classification requires more judgment and mentation. vVe then use the three criteria to divide tl1e consideration of external fo rces such as usage trends, space of performance issues into equivalence classes. hardware design trends, resource availability, and \Ve define our criteria and estimates as fo llows: expertise in a given code base. Issues classifiedas com­

• Usage fr equency. The usage frequency is said to be mon and high but difficultare appropriate fo r a major common if the language fe ature or code pattern achievement of a given release, whereas an opportu­ appears in a large fraction of source modules or nity that is uncommon and moderate but easy might uncommon if it appears in only a fe w modules. be an appropriate task fo r a novice compiler developer. When the language fe ature or code patternappears So-called "nonsense optimizations" are often con­ in most modules fo r a particular application domain troversial. These are opportunities that are almost predominantly, the usage frequency is said to be nonexistent in human-written source code, fo r exam­ skewed. The classic example of skewed usage is the ple, extensive operations on constants. Ordinarily they complex data type. would be considered unattractive candidates; how­

• Payoff from implementation. Improvement in an ever, they can appear in hidden fo rms such as the result implementation is estimated as high, moderate, or of macro expansion or as the result of optimizations small. A high improvement would be the elimina­ performed by earlier phases. In addition, they often tion of the language construct (e.g., removal of have high per-use payoff and are easy to implement, so unnecessary constructors in C++) or a significant it is usually worthwhile to implement new nonsense fraction of their overhead (e.g., inlining small func- optimizations when they are discovered.

Digital Technical Journal Vol. lO No. l 1998 39 Management control and resource allocation issues • The gprof and hiprof tools arc supplied in the can arise when common, high, or easy opportunities development suites for DIGITAL UNIX. Both involve software owned by groups not under the operate by building an instrumented version of the direct control of the compiler developers, such as test software (the compiler itself in our case). The headers or libraries. gprof tool works with the compiler, the linker, and the loader; it is available from several UNIX ven­ To ols and Methodology dors. Hiprof is an Atom tool'' 11 available only on DIGITAL UNIX; it does not require compiler or We begin this section with a discussion of pertonnance linker support.

evaluation tools and their application to problems. We The benchmark exhibiting the performance prob­ then brieAy present the results of three case studies. lem can then be compiled with the profilingversion of the compiler, and the compilation profilecan be To ols and Their Application to Problems captured . Using the display fa cilities of the tool, we Tools fo r perfo rmance evaluation are used fo r either can analyze the relevant portions of the execution measurement or analysis. Tools fo r measurement are profile. We can then compare this profilewith that designed mainly fo r accurate, absolute timing. Low of the reference version to localize the problem to a overhead, reproducibi lity, and stability are more specificarea of compiler source . Once this inf(xma­ important than high resolution. Measurement tools tion is available, a specific edit can be identifiedas are primari ly used in regression testing to identifY the the cause and a solution can be identified and existence of new performance problems. Tools tor implemented. Another round of measurement is analysis, on the other hand, are used to isolate the needed to verity the repJir is effective, similar to the source code responsible fo r the problem. High, rela­ procedure for addressing a fu nctionJI regression. tive accuracy is more important than low overhead or • VVhen the problem needs to be pinpointed more stability here. Analysis tools te nd to be intrusive: they accurately than is possible with these profiling add instrumentation to either the sources or the exe­ tools, we use the !PROBE tool, which can provide cutable image in some manner, so that enough inf(x­ instruction-by-instruction details about the execu­ mation about the execution can be captured to tion of a function.14 provide a dctaik:d profi le. \iVe have constructed adequate automated measure­ VVe have used the following tools or processes to r ment tools using scripts layered over standard operating run-time analysis: system timing packages. For compile-time measure­ • \Ve apply hiprof and gprof in combination, and ment, a driver reads the compile commands from a file the !PROBE tool as described above, to the and, after compiling the source the specified number run-time behavior of the test program rather than of times, writes the resulting timings to a fi le. Post­ to its compilation. processing scripts evaluate the usability of the results • We analyze the NULLSTONE results bv examining (average ti mes, deviations, and fi le sizes) and compare the new results against a set of reference results. For the detailed log file. This log identifies the problem and the machine code generated. This analysis is usu­ compile-time measurement, the default, debug, and optimize compibtion modes are all tested, as previ­ ally adequate since the tests arc generally quire simple. ously discussed . • If more detaiJed analysis is needed, e.g., to pi n ­ These summarized results indicate if the test version point cache misses, we usc the highly detailed has suffered performance regressions, the magnitude results generated by the Digital Continuous of these regressions, and which benchmark source is Profiling Inti·astructurc ( DCPf) tooJ H·1 ' DCPI can exhibiting a regression. Ana lysis of the problem can display detailed (average ) hardware behavior on an then begin. instruction-by-instruction basis. Any scheduling The tools we use fo r compile-speed and run-time problems that may be responsible f( >r fi-equent analysis are considerably more sophisticated than the cache misses can be identified fi·om the DCPl out­

measurement tools. They are genera lly provided by put, whereas they may not always be obvious from

the CPU design or operating system tools develop­ casu ally observing the machine code.

ment groups and are widely used fo r application tun­ • Final ly, we use the estimated schedule dump and ing as well as compiler improvements. VVe have used statistical data optionally generated by the GEM the fo llowing compile-speed analysis tools: back end.1 This dump tells us how instructions are sched uled and issued based on the processor archi­ • The compil er's internal -show statistics fe ature tecture selected. It may also provide information gives a crude measure of the time req uired to r each about ways to improve the schedule. compiler phase.

40 Digiral Tec hnical Jounul Vo l. lO No. 1 1998 In the rest of this section, we discuss three examples called by esc . Since these components are included in of applying analysis tools to problems identifiedby the dle GEM back end, the problem was fixedthere. performance measurement scripts. Run-Time Te st Cases Compile-Time Te st Case For the run-time analysis, we used two diffe rent test Compile-time regression occurred after a new opti­ environments, the Haney kernels benchmark and the mization called base components was added to tbe NULLSTONE test nm against gee. GEM back end to improve the run-time performance of structure reterences. Table l gives compile-time test Haney Kernels The Haney kernels benchmark is a results that compare the ratios of compile times using synthetic test written to examine the performance of the new optimized back end to those obtained with specific C++ language fe atures. In this run-time test the older back end. The resu Its fo r the iostream test case, an older C++ compiler (version 5.5) was com­ indicate a significant degradation of 25 percent in the pared with a new compiler under development (version compile speed fo r optimize mode, whereas the perfor­ 6.0). The Haney kernels results showed that the ver­ mance in the other two modes is unchanged. sion 6.0 development compiler experienced an overall To analyze this problem, we built hi prof versions of performance regression of 40 percent. We isolated the the two compilers and compiled the iostream bench­ problem to the real matrix multiplication function. mark to obtain its compilation profile. Figures la and Figure 3 shows the execution profile fo r this fu nction. l b show the top contributions in the flat hiprof pro­ We then used the DCPI tool to analyze perfor­ fi les fr om the two compilers. These profiles indicate mance of the inner loop instructions exercised on ver­ that the number of calls made to esc and gem_il_peep sion 6.0 and version 5.5 of the C++ compiler. The in the new version is greater than that of the old one resulting counts in Figures 4a and 4b show that dle and that these calls are responsible fo r performance version 6.0 development compiler suffered a code degradation. Figures 2a and 2b show d1e call graph scheduling regression. The leftmost column shows the profiJes to r esc fo r the two compilers and show me calls average cycle counts fo r each instruction executed. made by esc and the contributions of each component The reason fo r this regression proved to be that a test

Ta ble 1 Ratios of CPU (User and System) Compile Ti mes (Seconds) of the New Compiler to Those of the Old Compiler

File Name Debug Mode Default Mode Optimize Mode

Options -00 -g -04 -gO a1amch2 0.970 0.970 0.930 collevol 0.910 0.780 0.740 d_inh 0.970 0.960 0.960 e_rvirt_yes 0.970 0.980 0.960 interfaceparticle 0.880 0.790 0.730 iostream 0.990 0.980 1.250 pistream 0.890 0.760 0.790 t202 0.970 0.970 1.130 t300 0.980 0.960 1.040 t601 1.010 1 . 020 1.010 t606 1.000 1 .020 1.020 t643 1.020 1.010 1.000 test_complex_excepti 0.960 0.890 0.830 test_complex_math 0.970 0.950 0.950 test_ demo 0.950 0.830 0.780 test_generic 1.000 1.020 1.100 test_task_queue6 0.970 0.920 0.960 test_task_rand 1 0.950 0.890 0.890 test_vector 0.970 0.920 1.120 vectorf 0.890 0.790 0.850

Averages 0.961 0.920 0.952

Digital Technical Journal Vol. 10 No. I 1998 41 g ranu l ari ty : cycle uni ts : seconds ; to tal : 4 8 . 9 6 seconds

% cumulative sel f self total time seconds seconds cal ls ms/cal l ms /ca1 l name 2 . 8 1 . 37 1 . 37 10195 0 . 13 0 . 13 cse [12] 2 . 6 2.66 1.29 219 607 0 . 01 0 . 0 1 gem_j l_oeep ( 3 1 ] 2 . 6 3.9 3 1.2 7 515566 0 . 0 0 0 . 00 gem_ fi_ ud_access_resource [67] 2.4 5.09 1 . 17 481891 0 . 00 0 . 0 0 gem_vm_get_nz [37] 2.3 6.2 3 1.14 713176 0 . 00 0.0 0 _OtsZero [75)

(a) HiprofProfile Showing Instructions Executed with the New Compiler

granu lari ty : cycles ; un its : seconds ; t ocal : 27 . 4 9 seconds

c� cumulative self sel f total ime seconds seconds ca lls ms / ca l l ms/ call name 3 . 0 0 . 8 3 0.83 143483 0 . 0 1 0 . OJ gem_i I _peep [ 40] 1 . 58 0 614350 0 . 00 ?. . 7 . 7 5 0 . 0 0 _O t s z er o [64 J 2 . 5 2 . 26 0 . 6 8 8664 0 . 0 8 0 . 08 cse [16] 1 . 7 2.71 0 . 4 5 465634 0 . 0 0 0 . 0 0 gem_ fi_ud_ access_resource [86] 1.6 3.14 0.43 423144 0.00 0 . 0 0 gem_vm_gec_nz [ 3 6 )

(b) Hiprof Proftle Showing Instructions Executed with the Old Compiler

Figure 1 Hiprof Profiles of Compilers

fo r pointer disambiguation outside the loop code was Initial NULLSTONE Te st Run against gee We measured not performed properly in the version 6.0 compiler. the performance of the DEC C compiJer in compiling

The test would have ensured that the pointers a and t the NULLSTONE tests and repeated the performance were not overlapping. measurement of the gee 2.7.2 compiler and libraries We traced the origin of this regression back to the on the same tests. Figures Sa and Sb show the results intermediate code generated by the two compilers. of our tests. This comparison is of interest because gee Here we fo und that the version 6.0 compiler used a is in the public domain and is widely used, being the more modern fo rm of array address computation in primary compiler available on the public-domain the intermediate language fo r which the scheduler had operating system. Figure Sa shows the tests in not yet been tuned properly. The problem was fixedin which the DEC C compiler performs at least 10 per­ the scheduler, and the regression was eliminated. cent better than gee. Figure Sb indicates the optirniza-

[ 1 2 ] 14 .1 1. 37 5.5 5 10195+9 95 cse [12] 2 . 63 134485/134485 test_for_c se [42 ] 0 . 6 3 1 34485 / 13r.485 update_operands [ 9 2 ] 0 . 5 9 102760/102760 LesL_ for_inducti on [97] 0.34 1 2 1 2 43/12 124 3 gem_ df_mo ve [ 13 6 J 1 2 1 2 7 / 12 1 27 0.32 push_e f ect [ 1 4 91

(a) Hierarchical Profile for cse with the New Compiler

[ 1 6 ) 1 0 .5 0.68 2 . 19 8664+7593 cse [16] test 1.04 9 6554 / 96554 -for_c se [56] 0 . 3 0 66850/66850 t est_for_ induc t ion 1104 ] 0 . 29 9 6 554/96554 up date_operands [106] 0 . 1 2 87 1 7 6 / 8717 6 move [ 2 1 5 ] 0.09 7863!7863 pop_effect [267]

(b) Hierarchical Profile for cse with the Old Compiler

Figure 2 Hierarchical Call Graph Protiles for esc

42 Digital Technical journal Vo l. 10 No. l 1998 void nna �1 lHC (Real • t, References and Notes cons Rea l * a , const Real * b. 1. D. Blickstein et al., "The GEM Optimizing Compiler cons in M, const in N, const in� Kl System," D(t5ital Tc'ch nica! Journal, vol. 4, no. 4 int i, j , k; (Special issue, 1992): 121-136. Real emp ; 2. B. Ctlder, D. Grunw�ld, ;md B. Zorn, "QuanritYing memse llt , 0, H • N * siz o f ( Rea lll; Beh�vioral Difkrences Berween C and C++ Programs," jo urnal of Pru,f5rctnuning Lcmi� uages, 2 ( 1994 ): for- ( j � 1; j <= N; j • I 313-351 { for lk - l; k c� K; k + + ) 3. D. Detlefs, A. Dosser, and B. Zorn, "Memory AJ ioo­ ( tern = b [ k - 1 ,. K * I j 1)]; tion Costs in Large C and C++ Programs," Sojitl'are if (temp != 0 . 0 ) Practice and l:..,perience, vol. 24, no. 6 ( 1994 ): { 527-542. E r li - l ; i <= M ; i HI t [ i - 1 • �l • ( j - 11 l + - 4. P. Wu :md F. Wang, "On the Efficiencv and Optimiza­ · • ll ernp * a { - 1 - H • I k J ; tion of C++ Programs," Sojiu>are Practice and Exp eri­ ence. vol. 26, no 4 (1996): 453-465.

5. A. Itzkowitz and L. Folt:111, "Automatic Te mplate Instanri:Jtion in DIGITAL C++," Digital Te chn ical Figure 3 Jo urnal. vol. 10, no. I (this issue, 1998): 22-3 1 Haney Loop r()r Real Matrix Nlultiplication 6. NULLSTONE Optimization Categories, URL: http://w>vw. nullsrone .com/h tmls/ category. h tm, Nullsrone Corporation, 1990-1998. tion tests in which the DEC C compiler shows 10 per­ cent or more regression compared to gee. 7. ]. Orost, "The Bench++ Benchmark Suite," December 12, 1995. We investigated the individ ual regressions by look· A drati: paper is available at http:/jwww . research .a tt.com/ -orost/bench _pl us_plus/paper. h tml. ing at the detailed log of the run and then examining the machine code generated fo r those test cases. In this 8. C++ Benchnurks, Comparing Compiler Performance, case, the alias optimization portion showed that the URL: http:/jwww .bi.com/index.html, Kuck <1 11d regressions were caused by the use of an outmoded Associates, Inc. ( KAI ), 1998. standard " as the default language dialect ( -st 0 ) fo r 9. ATOJiif.o Us er Mwtllai ( Maynard, Mass.: Digital Equip­ DEC C in the DIGITAL UNIX environ ment. After we ment Corporation, 1995 ). retested with the -ansi_al ias option, these regres­ 10. A. Eustace and A. Srivast:wa, "ATOM: A Flexible sions disappeared. Interface fo r Building High Performance Program We also investigated and fixed regressions in Analysis Tools," Western Research Lab Technical Note instruction combining and if optimizations. Other TN-44, Digital Equipment Corporation, July 1994. regressions, which were too difti cult to fi x within the existing schedule fo r the current release, were added 11. A. Eustace, "Using Atom in Computer Architecture to the issues list with appropriate priorities. Teaching and Research," Co mputer Architect/Ire Te chnical Cornmittee Neil 'sletter IEEE Computer Society, Spring 1995 : 28-35. Conclusions 12. J. Anderson et al., "Continuous Profiling:Where Have The measurement and analysis of compiler performance Allthe Cycles Gone?" SRC Technical Note 1997-016, has become an important and demanding fi eld. The Digit::tl Equipment Corporation, July 1997; ;t lso in increasing complexity of CPU architectures and the ACM Tra nsactions 011 Computer Svstems. vol. 15, no. (1997): 357-390. addition of new fe atures to languages require the devel­ 4 opment and implementation of new strategies fo r test­ 13. J. Dean, ). Hicks, C. W::t ldspurger, W. Weihl, and G. ing the perf(xmance of C and C++ compilers. By Chrysos, "ProtileMe: Hardware Support fo r Instruction­ employing enhanced measurement and analysis tech­ Level Profilingon Out-ofOrder Processors," 30th Sym­ niq ues, tools, and benchmarks, we were able to address posium on Microarchitecrure ( Micro-30), Raleigh, N.C., these challenges. Our systematic ti·a mework to r com­ December 1997. piler performance measurement, analysis, �m d prioriti­ 14. G'u ide to /PRO/Jt·. lustct!linp, and r:,ing (Maynard, zation of improve ment opportunities should serve as an Mass.: Digital Equipment Corporation, 1994). excellent st:u-ting point fo r the practitioner in :� situation 15. B. Kerninghon �nd D. Richie, Th e C Progra mmiug in which simil:�r requirements :u-cim posed . Language (Englewood Cliffs, N.J.: Prentice- Hall, 1978 ).

Digital T.:chnictl Journal Vo l. 10 No. l 1998 43 tm t.l lHC_X PtPCtPC ' ii :

181 xl 200�4894 0 : 882 7 0000 l ds $£1, 0(L6) 70 Oxl 2001 4 898 O : a3e70080 1 '1 zero , 1 8( 61 62 4 O xl 2 00 4 8 9c 0 : 894 6000 0 lds SflO , 0(t5 1 33 6 O xl 00 48a 0 0 : 58011041 mu ls $ f0 , $ tl , S tl 1 Ox1200148a4 0 : 4 7e 6 0 412 bi s zero , L a 2 0 0 :< 1 00148a 8 0 : 40 09005 a dl t 4 , O x 4 , 4 3058 0 :< 120 0 1 � 8ac 0 : 20c6 0 010 lda <:5 , < 5 1 15 O x 1 2 00148b0 0 : 4 0a80 b4 cmple t4 , L 7 , a4 0 O x.l 2 00148b4 0 : 2 0 e70 0 1 0 lda L6 , :6 (t6) 7265 0 :.; 1 002.� 8b8 0 : 5941100 ... add s $fl O , fl . $fl 12784 O xl 0 1 48b 0 : 9826fff ts fl, -16 (t5) 3207 x l20014 8c0 0 : 8 9 6 7 £ f4 lds Sfll, -12 (L6) 0 'J x 1-D014 8c4 : 8986fff4 _ds $tl2, -12 ( t 5) 6 4 :<1 001 48c 8 : 58 b 4b muls $ f " , $fll , Sfll 13054 0:<1200 48cc 0 : 5 98bl00b S f12 . S f ll , Stl" 1L88 Ox1200148d0 0 : 9 66fff4 $fll. -12 (t5) 32 5 O x l 2 o:48d4 : 8 7 f f8 $ 13 , -8 (t ) 0 O x l20 0 1 4 8 8 0 : 89c fff8 $ 1 4 , -8 (t5) 6388 Ox 1 2 0 0148dc 0 : 580 d.._04 d $ f 0 . $ . 3 . $ £ 1 3 1 2 8 6 2 Oxl2 0:. 48e0 : 5 c .10 d $ f l 4 , $ f l . $ fl) L 87 Ox1 2 001 48e4 0 : 9 9a 6fff s 1 3 , -8 t5 ) 3134 Ox12 0 148e8 0 : 9e7fffc £15, -4 (t6) 0 Ox1 20 0 1 4 8ec 0 : 8a0 6 f $ fl6 , -4 (t5) 6357 O:

(a) DCPI Profile fo r This Execution with Ve rsion 6.0

l·ma t MulHC_X PfPC fPC fCiCiCi :

351 O x l20 0 194 d0 0 : 8 8 2 70000 1 s $fl. 0 ( 0 Oxl200194 4 0 : 4 0a09005 a 'dl 4 , O x4 , t1 3 :i3 ' Oxl2 194d8 0 : 8 % 6 0 lds SflO, 0 (L5) 0 Ox 1 2 0 0 194 c 0 : 4 0a 8 0db4 cmp le 4 , t7, 4 32' 5 O xl2 u 0 194 e 0 0 : Oe70010 .aa t6 , 161t6) 17968 x12 019 e4 0 : 58011 80d10 4 d mu i s 0 , $ 13 .. ,fl3 0 O x 120 019514 0 : 8 6fff ld $ f 1 6 , -4 (t5)

67 • Oxl20019518 0 : 5 9c b 100b a 'd $f14 , f11 , SEll 3'68 O:< i 2 001 951 : 5 9ec l 00c adds $f15, $ 12 , $£1:! 3066 O x 1 2 0 0 1 9 5 2 0 0 : 5a0 lOOd a ds Sfl6 , S fl3 , $fU 6 58 Ox 1 2 0 01952tl : 9 966fff4 sts $f:l, - 12 (t5) 3 13tl xl2 19 28 0 : 9 9 8 6 ftf8 s s $ El2, -8( t5 ) 3200 O x ' 2 0 01952c 0 : 9 9 6ffE s s $f13, -4 ItS) 316 8 Ox' 0019530 O : f69fffe7 bne a4 , O x l 2 0 01 4d0

(b) DCPI Profile with Counts with Ve rsion 5.5

Figure 4 DCPI Profiles of rhe Inner Loop

44 Digiral 'l'cchnical Journal Vo l. 10 No. l 1998 ULL TONE SU:• \ARY PERE'OR!,�CE Hl PROVE�1Et\'T EPORT lls one elease 3 . 9b2

------+ � ------1 l'h:reshol d : , ullstone Ra tio Increase by a lease 10% ------�--+ � ------+------Baseline Compi - er Compari o . Com_ � · er · ------� ------� ------+ Comp i l er GCC '2 . 7 . 2 DEC Alpha C 5 . 7 1 2 3 bl 36 no r esu ic · Architec �:e DEC Alpha DEC Al pha , odel .3 000/30 0 300 0 /300 + + ------i------+ ------0 t'm'�a i on Sa ple ize I Impr ve ents I + ------+ ----·------+ Al ias 0 imi zation ( y typ ) 102 t !:S tests I Al ias Op . im ' za ion (con ualified) 11 0 tes s Al ias Op �miza ion ( y a dr ess ) 5 2 19 Lests s Bi field Op imiza ion 3 3 te ts Branch El:. ina- ion 15 15 tests Instruct�o� Combining 2510 2026 tests Co sLan Fol in 56 tests 56 tes t s Con rant Propa atio 15 ests 8 tests CSE E�imi�a ion 2600 est 2353 es s Dead Code Elimination 306 ces s 278 cests Inle" er Divide Optimi za ion 92 ests 15 tests Express ion Simpl ifi caL i o n 181 Lests 1/.0 ests If 09 imiz-don 69 l s s 13 tes ts Func ion In. in ' ng 3 9 tests 39 t est s Indu c L ' on Var i ble El imi na tion 4 t SI.:S 3 test s S reng th Reduc t:ion 2 es·s 1 ests Hoisting 38 tests 18 Les s Lo p Un�:olling 16 t.:es s 11 teSLS Loop CoL a9s in 3 Lesu; 3 es ts Loop Fus ion 2 t sts 2 ests UnsH itchi � 2 res s 1 tests Block t·lerg ing l es ts 1 tes s Cl:OSS J mping 4 tes s 2 res s Int eget. l·iodu l s Op imiza ion 92 Les s 26 res _ Intege:r Mul tiply Opt imizat ion 99 ests 3 tes ts Address Op imi za ion 26 es ts 20 tes ts Poinler Op timi z ti n 15 tes s 9 tests P�intf Op i iza ion 3 Le s 3 tes !'on1ard Store 3 tes s 3 tes ts Va lue Range Oplimiz Lion 30 t ts 0 res s T< il Recur ion 4 t SLS 2 tests Re ister Allocation 4 tests 1 tes ts N t'l'OWing 3 es cs 0 t e s t s SPEC Co formance 2 tes ts 0 tes ts SLatic ec lara tions l tes t 1 tests String Op imization 4 tes s 4 tests Vo latile Conformanc 90 tests 0 tests

+------�----- • ------T ------+ ov 1 Tot 1 Pe�:formanc Impr emen ts >� 10% 6499 es . 5065 tests

Figure Sa NULLSTONE Results Comparing gee with DEC C Compiler, Showing All Improvem.cnts of Magnitude 10% or More

Digiral Technical Jourrul Vo l. 10 No. l l99S 45 NULLSTONE SU}WARY PERFORMfu�CE REGRE .Sl O, R EPORT �llsto � Re" ease 3 . 9b2 +------

I Threshol : N lls on R cio Decreas d by ac lea t 10% + ------� ------+-

B · elir.e Comp�ler Compa ri son ompilet + ------+ ------+ ------+ Compiler I G CC 2.7.2 DEC A lpha . 5 . 7 -123 bl 36 I no rest.rict Arch · tec'.ute DEC Alpha DEC Alpna �1odel I 300 t30C 300 0 / 30 � ------· ------·------� ------+

Op timization Sample Si,:e I Regr· ssions ·------+ ----

1 Al i a · Op Li i:at�on (by typ ) 10 ests 64 es s Alia� Op tlrni zat · on l const- aua !:.lea ) 11 es s a t s 5 Alias Op irni :::a · on (by n dd;ess ) 5 7 ests 7 tests Ins truc ti n Cornn' n ing 2 510 tests 204 es s Cons tan Propaga i on lj Les s 1 tests CSE El imina tion 2600 tests 32 :- ests In eger Divide timiz t on 92 tes s 32 I Expre sian Si mpl ifica t ion 181 tests 34 s I t ptimi zation 69 esLs 14 HoisLin J8 tes ts 4 Unswi �hing 2 tests � t.:est:s In eger Modu s OpLirnization 92 ests 40 L sts Integer Mu.Liply Opt imi zation 9 tests 95 es Pointet Optirniz tion 15 t ests 1 tests Ta il Recur ion 1 tests /. €'!;l:S i "'ltO\�ing 3 es s 2 t:es s

., ------t ------� ------�

i To al Performance Regressions >- 10% 6499 ests 1 542 tes s 1

Figure 5b NULLSTONE Res ults Comp�ring gee with DEC C Compiler, Sh owing AllRe gressions of 10% or vVors..:

Biographies

Kevin W. Harris Kevi n Harris is a consulting sofrw:�t-c engineer at Compaq, currentJy wo rking in the DEC C and C ++ D evelopment Hemant G. Rotithor Group. He has 21 ye:-t rs of ex perie nce working on high­ Hcmant Rotithor received B. S., td . S., :md Ph.D. d egrees performance compilers, optimization, and p:�ralkl pro­ cessing. Kevin grad uate d Phi BeLl Kappa in m:�rhem:�tics in electric:� I e ngin ..:..: ring in 1979, 1981, and 1989, respn:· tively. He worked on C �nd C++ c ompiler per�ornu n ce �i· om the Universiry of 1\lbryland cmd J Oined Digital i:;sues in tht: CoreTec hnology Grou p ;�t Digit;�J Equipment Equipment Corpora tion c1 ti:cr earningan M.S. in compurer Co rpor�t ion �or three years. Prior to that, he w;�s Jn cl ssis­ science ti:om the Penn sylvani a State U n iversity. He has a o , tant �) rofessor at vVorcester Polytech ni c Institute and :1 m de maj r contributions tothe DIGITAL Fortr:�n C, a d o uct fa milies. He ho ds p.tLcnts f( Jr tech niqu s devel opmen t ..:ngincer ctt Phi lips . Hemant is a member n C++ pr d l e of the p rogra m comminee ofThe lOth lnrnn :-ttional to r exploiting performance of shared memory multiproces­ ContC:rt:ncc on ParJllel and Distributed Computing an d sors and register allocation. H..: is c urren tl y responsible tor Systems (PDCS '98 ). He is a sc·nim member of rhe II:J-:E per formance issues in the DEC C and D1G!Tt\L C++ , and ,1 member of Eta Kapp:� Nu, T:1u Beta Pi , and Sigma product families. He is interested in CPU a rchitecture Xi. His interests include computer :� rchirccmre, perfor­ compiler design, large · and snull-scalc p.tra llelism :�ndirs ma nce an:� lvsis, digital design, and networki ng. Hennnt exploitation, and oti:warequal irv issues. is currentlv emplov..:d at Intel Corpor:�rion .

46 Dig:iralTe chnical )ournJI Vo l. 10 No. 1 1998 Mark W. Davis Mark Davis is a senior consulting engineer in the Core Technology Group at Compaq. He is a member of Compaq's GEM Compiler Back End team, toc using on performance issues. He also chairs the DIGITAL UnixCalling Standard Committee. He joined Digital Equipment Corporation in 1991 afterworki ng as Director of Compilers at Stardent Computer Corporation. Mark graduated Phi Beta Kappa in mathemat.ics from Amherst College and earned a Ph. D. in computer science !Tom Harvard University. He is co-inventor on a pending patentconcerning 64-bit software on OpenVMS.

Digital Tec hnical Journal Vol. 10 No. l 1998 47 I

August G. Reinig

Alias Analysis in the DEC C and DIGITAl C++ Compilers

During alias analysis, the DEC C and DIGITAl C++ vV hen two or more address expressions reference the compilers use source-level type information to same memory location, these address expressions are improve the quality of code generated. Without aliases for each other. A compiler performs alias anJJy­ sis to detect which address expressions do not refer­ the use of type information, the compilers ence the same memory locJ.tions. Good alias analysis is would have to assume that any assignment essential to the generation of efficient code. Code through a pointer expression could modify any motion out of loops, common subexpression elimina­ pointer-aliased object. In contrast, through the tion, allocation of variables to registers, and detection use of type information, the compilers can of uninitialized variables all depend upon the compiler assume that such an assignment can modify knowing which objects a load or a store operation could reference. only those objects whose type matches that Address expressions may be symbol expressions referenced by the pointer. or pointer expressions. In the C and C++ languages, a compiler always knows what object a symbol expres­ sion references. The same is not true with pointer expressions. Determining which objects a pointer expression may reference is an ongoing topic of research. Most ofthe research in this area fo cuses on the use of techniques that track which object a pointer expres­ sion might point to.u When these techniques cannot make this determination, they assume that the pointer expression points to any object whose address has been taken. These techniques generally ignore the type information available to the source program. The best techniques perform interprocedural analysis to improve th eir accurJcy. Although effective, the cost of analyzing a complete program can make this analysis impractical. In contrast, the DEC C and DIGITAL C++ compil­ ers use high-level type information as they perform

alias analysison a routine-by-routine basis. Limitingalias analysis to within a routine reduces its cost, albeit at the cost of reducing its effectiveness. The use of this type information results in slight improvements in the performance of some standard­ con forming C and C++ programs. These improve­ ments come at little expense in terms of compilation ti me. There is, however, a risk that the use of this rype information on nonsrand:�rd-conforming C or C++ programs may result in the compiler producing code that exhibits unexpected behavior.

48 Digital Te chnical Jou rnal Vol. lO No. 1 1998 The C and C++ Ty pe Systems The Side-effects Package

Research availab.le on the use of type intormation dur­ The DEC C and DIGITAL C++ compilers are GEM ing alias analysis involves languages other than C and compilers -" The GEM compiler system incl udes a C++.' Trad itional ly, C is a weakJy typed language. A highly optimizing back end. This back end uses the pointer that references one type may actually point to GEM data access model to determine which objects a an object of a different type . For this reason, most load or a store may access. GEM compiler fr ont ends alias-analysis techniques ignore type information when augment the GEM data access model with a side­ analyzing programs written in C. effects package, i.e., an alias-analysis package. The The ISO Standard fo r C detlnes a much stronger side-effe cts package provides the GEM optimizer typing system.' In ISO Standard C, a pointer expres­ additional information about loads and stores using sion can access an object only if the type referenced by language-specific information otl1erwise unavailable the pointer meets the fo llowing criteria: to the GEM optimizer. The DEC C and DIGITAL C++ compilers share a • It is compatible with the type of the object, ignor­ common side-efte cts package. The DEC C and C++ ing type quali fiers and signedness. side-effe cts package • It is compatible with the type of a member of an aggregate or union or su bmembers thereof, ignor­ • Determines which symbols, types, and parts thereof ing type qualifiers and signedness. a routine references • Determines the possi ble side efkcts of these reterences • It is the char type. • Answers queries fi.-om tl1e GEM optimizer regardi ng Thus, in Figure 1, the pointer p can point to A, B, tl1e effects and dependencies of memory accesses C, or S (through S.sub.m) but not to T or F. The pointer q, bei ng a pointer to char, can refer to any of Preserving MemoryReference Information A, B, C, S, T, or F. The DEC C and DIGITAL C++ fr ont ends perform The proposed ISO Standard fo r C++ defines a simi­ lexical analysis and parsing of the source program, lar typing system fo r C++.' The strength of the generating a GEM intermediate language (GEM IL) Standard C and C++ type systems allows the DEC C graph representation of the source program 6 A tuple and DIGITAL C++ compilers to use type information is a node in the GEM IL and represents an operation in during alias analysis. the source program. Many existing C applications do not conform to the As the DEC C and DIGITAL C++ tfontends gener­ Standard C typing rules. They use cast expressions to ate GEM IL, they annotate each fe tch (read ) and store circumvent the Standard C type system. To support (write) tuple with intormation describing tl1e object these applications, the DEC C compiler has a mode being read or written. The fr ont ends annotate fe tches whereby it ignores type information during alias analy­ and stores of symbols with intormation about tl1e sym­ sis. The DIGITAL C++ compiler also has such a mode. bol. They annotate fe tches and stores tlu-ough pointers This mode exists to support those C++ programmers with information about tl1e type tl1e pointer references. who circumvent the C++ type system. The annotation intornution includes information describing exactly which bytes of the symbol or type tl1e tuple accesses. This al lows the side-effe cts package

int to differentiate between access to t\vo diffe rent mem­ i gned i� � c n st B ; bers of a structure. uns i gned int vol tile C ; sLruct: { stru Arrays Neitl1er the DEC C nor the DIGITAL C++ int m; tfont end ditferentiates bet\veen accesses to different ) s b ; elements of an array. Both assume that aU array accesses $ ; sLr ct { are to the first element: of the array. The GEM optimizer horL z; does extensive analysis of array references.7 Being flow ) T ; insensitive, the DEC C and C++ side-effe cts package flo t F; can, at best, diffe rentiate between two array references i 'p; tl1at both use constam indices. The GEM optimizer can c ha r *q; do much more. V/hat the GEM optimizer cannot do, however, is Figure 1 determine that an assignment through a pointer to an Code Fragmenr Associated with rhe E.xpbnationofthe int: does not change any value in an array of doubles. Standard C Aliasing Ru les This is the purpose oftl1e DEC C and C++ side-eftects package. Mapping all array accesses to access the first

Digital Tec hnical Journal Vo l. 10 No. I 1998 49 element of an array does not hinder this purpose and an object. To minimize the number of effe cts classes simplifiesal ias analysis of arrays. under consideration, the side-effe cts package creates effects classes fo r only those object regions referenced Tu ple Annotation Example For the program fi·agment within the curre nt routine. in Figure 2, the DEC C and DIGITAL C++ ti·ont ends Having created effects classes tor each referenced generate the annotated tuples displ

• Ex:�mining each tuple within the ro utine that refer­ Allocating Effects Classes There are two kinds of ences (reads or writes) memory, al locating ef'fects effects classes. The first ki nd represents a region of :�n classes that represent the memory that the tuple individual object. The second kind represents a region references of all al located objects of a particular type. Allocated • Performing type-based alias analysis objects are those created by the lloc () fu nction • Responding to alias-analysis queries fr om the GEM and its relatives or the C++ �� operator. optimizer As it processes the tu ples within a routine, the side­ effects package examines the memory reference infor­ To determine the possible side effe cts of a memory mation associated with the tuple. The side-effects :�ccess, the side-efte cts package p:�rtitions memory into package creates :�n effects class to r each diffe re nt set of effects classes. An effe cts class represents all or part of memory reference intormation it encounters. Tvvo sets of memory retcrence int

Ta ble 1 Tu ple Annotations

C/C++ Source Annotation Annotation Expression Tu ple Symbol Ty pe Start Byte End Byte

Fetch p p struct S * 0 7

p->X = 3; Store p->x none struct S 0 3

v1 .y = 3 Store v1.y v1 struct S 4 7 Fetch v1 v1 struct S 0 7

v2 = v1 Store v2 v2 struct S 0 7 Fetch d[O] d double 0 7

d[i ] = d [O] Fetch i int 0 3 Store d(i] d double 0 7

50 Vo l. 10 To. 1 1998 if two members occupy exactly the same memory loca­ Using tl1e traditional C type system, fo r the program tions, a single effects class represents both members. fragment shown in Figure 3, the side-effe cts package For the program fragment in Figure 3, the side­ creates the effects classes displayed in Table 3. Here, effects package creates the effects classes displayed in effects class 7 replaces effects classes 7 through 11 in Table 2. Table 2. All the diffe rentiation by types djsappears. There is only one effects class fo r *uip and *ip since uip and ip may point to the same object. There are no Effects-class Signatures Having created the effects effects classes for bytes 0 through 3 ofs and struct S as classes, the side-effe cts package associates a signature there arc no references to s.x or sp->x. By allocating with each effects class. In addition, it associates an effects classes fo r only those object regions referenced effects-class signature with each tuple within the rou­ within the routine, the side-effe cts package greatly tine and each symbol referenced within the routine. reduces both the number of effects classes and the An effects-class signature records the possible side time required to perform alias analysis. effects of referencing an effects class. A reference to In the traditional C type system, a pointer expres­ one effects class may reference another effects class. sion may point to anything, regardless of type. To rep­ The effects class fo r a load through a pointer to an int rcst:nt this, the side-effe cts package creates exactly one indicates that the load references an al located int eftects class to represent allocated objects. It ignores object. The pointer to an int may actually reference a the type and the start- and end-offset information . pointer-aliased int symbol or an int member of a struc­ ture or union. An effects-class signature is a su bset of all the effects

St:.rUCL S { classes that might be referenced by a tuple. There is inl x; only one requirement fo r an effects-class signature: If s r c T two tuples may refer to the same part of memory, the int y; flo t z; intersection of their respective effects-class signatures t; must be non-null. If two tuples cannot refer to the s ; same part of memory, it is desirable that tl1e intersec­ stl.·uct s * p; signed int • ip ; tion of their effects-class signatures is null. An empty u signed i nt • uip ; intersection leads to more optimization opportunities. loa * fp; The most obvious rule fo r building an effects-class

* u ip : * ip ; signature is to include in it all the effects classes that * fp = 2; might be to uched by a reference to tl1e effects class. sp ->t = s . e; This leads to suboptimal code in cases such as that sp- . y = 2; s - • sp ; shown in Figure 4. There are three effects classes fo r this code, s<0,3>, S<4,7>, and S<0,7>, generated by references to s.x, s.y, Figure 3 and s, respectively. If the effects-class signature fo r Code Fragment Associated with Allocating Efkcts Classes S<0,3> includes both s<0,3> and s<0,7> and the effects-class signature fo r s<4,7> includes both s<4,7> and s<0,7>, then the intersection of these 1:\vo effects- Ta ble 2 Effects Classes Using the Standard C Ty pe Rules

Ty pe or Source Generating Effects Class Symbol Start Offset End Offset Effects Class

1 0 11 2 4 11 s.t 3 sp 0 7 sp 4 fp 0 7 fp 5 ip 0 7 ip 6 uip 0 7 uip 7 struct 5 0 11 *sp 8 struct 5 4 11 sp->t 9 struct 5 4 7 sp->t.y 10 float 0 3 *fp 11 int 0 3 *uip and *ip

Digital Te ch nical journal Vo l. tO No I 1998 51 Ta ble 3 Effects Classes Using the Traditional C Ty pe Rules

Effects Class Ty pe or Symbol Start Offset End Offset Source Generating Effects Class

1 0 11 5 2 4 11 s.t 3 5p 0 7 sp 4 fp 0 7 fp 5 ip 0 7 ip 6 uip 0 7 uip 7 char 0 *sp, sp->t, *uip, sp->t.y, *fp, *ip

class signatures is non-null. This ta lsely indicates that references to two separate members of :1 symbol do s.x and s.y may refer to the same memory location. This not interfere with each other. In Ta ble 4, the eftects­ fo rces GEM to generate code that stores s.y afterstor­ class signatures fo r S<0,3> and s<4,7> do not intcrkrc ing to s.x. with each other. Both signatures interfere with the The DEC C and C++ side-ef'tects package uses more effects-class signature tor s<0,7>. effective rules fo r building effe cts-class signatures. These The inclusion of effects classes representing parti::�lly rules offer more optimizationopporn mitieswhil e pre­ overlapping regions of a symbol allows to r the correct serving necessary dependency intormation. representation of the side effects of referencing sub­ members of complex unions. Effects-class Signatures for Symbols If an effects class represents a region A of a symbol, its signature includes Effects-class Signatures for Ty pes If Jn efkcts class itself Its signature also includes all efrecrs classes repre­ represents a region of a type, the contents of its signa­ senting regions of the symbol wholly contained within ture depends upon the type. Iftbe type is the char type,

A. Finally, it includes any eftects class representing a the effects-class signature contains all the eftects classes region of the symbol that partially overlaps A. It does representing regions of other types or pointer-aliased not include effects classes representing regions of the symbols. This reflectsthe C and C++ type rules, which symbol that do not overlap A or th::�t wholly contain A. state that a pointer to a char can point to :mything. Ta ble 4 gives the symbol effects-class signatures fo r Ifthe typeis some type T other than char, the effects­ the three effects cl:lsscs under discussion. class signature contains dlects classes representing: The inclusion ofsu bregions in an effects-class signa­ • Those regions ofT that overlap the region ofT the ture means that references to symbols interfere with effects class represents, using the same ovnlap ru les references to members therein and vice versa. Excluding JS for symbols super-regions in an effects-class signature means that • Any region of a pointer-aliased symbol whose type is compatible to T, ignoring type qu::�lifiers ::�nd signed ness struct S { • A region of a pointer-aliascd aggregate or union int: x; int y; symbol that contains a member or submember s ; whose type is compatible to T, ignoring type quali­ S . X - •••; fiers and signed ness s . y - ... ; s; re tutn • A region of an aggregate or union type that con­ tains a member or submember whose type is com­ Figure 4 patible to T, ignoring type qualifiers and signedness Example ofProblem atic Code fo r the NaYvc Ru le fo r Building Etlccrs-class Signatures Table 5 gives the signatures fo r the efkcts classes in Ta ble 2, assuming that the symbol s is pointer aliased. Including the effects classes of symbols in the effects­ Ta ble 4 class signatures of types records the interference of Symbol Effects-class Signatu res references through pointers withreferences to pointer­

Effects Class Effects-class Signature aliased sym bols. In Figure 3, the pointer uip points to an unsigned int. The member s.t.y hJs type int. Thus, S<0,3> 5<0,3> uip may point to s.t.y. The member s.r contains s.t.y. S<4,7> 5<4,7> Thus, the signature fo r the effects-class int<0,3> con- S<0,7> <0,3>, 5<4, 7>, 5<0, 7>

52 Dip.italTc chniol Journal Vo l. !0 No. l 1998 Ta ble 5 Responding to Optimizer Queries During optimiza­ Ty pe Effects-class Signatures tion, ilie optimizer makes two types of queries to the side-effe cts analysis routines: domi nator-based queries Number Effects Class Effects-class Signature and nondominator-based queries. 1 S<0,1 1> 1, 2 When doing nondominator- based optimizations, tJ1e 2 S<4,1 1> 2 optimizer uses a bit vector to represent iliose objects a 3 sp<0,7> 3 write may ch ange (its effects). A similar bit vector repre­ 4 fp<0,7> 4 sents those objects whose value a read may fe tch (its 5 ip<0,7> 5 dependencies). Each bit in tJ1e bit vector represents an 6 uip<0.7> 6 effects class. If a tuple's effects-class signan1re contains 7 struct 5<0, 11> 1, 2,7,8,9 an effects class, iliateff ects class's bit is set in ilie tuple's bit vector. The optimizer uses ilieun ion of ilie bit vec­ 8 struct 5<4, 11> 1, 2, 8, 9 tors associated witJ1 a set ofn1ples to represent the com­ 9 struct 5<4, 7> 1, 2, 9 bined effects or dependencies of those mples. 10 float<0,3> 1, 2, 7, 8, 10 Dominator-based queries involve fi nding the near­ 11 int<0,3> 1, 2, 7, 8, 9, 11 est dominating tuple that might write to the same memory location as the tuple in question. Tuple A dominates tuple B if every path fr om the start of the routine to B goes through A.8 If both tuples A and C rains the etiects-class s<4, ll>. This means that the dominate B, tuple A is the nearer dominator ifC dom­ load of s.t depends upon the store through uip. inates A. Including the effects classes of types in the signa­ When doing dominator- based opti mizations, the tures of the effects classes of other types records the side-effe cts package represents the tuples in the cur­ interference of references through a pointer with ref­ rent dominator chain as a stack, adding and removing erences through pointers to other types. In Figure 3, tuples from the stack as GEM moves fr om one path the pointer fp points to a float object. The member in the routine's dominator tree to another. Searching sp ->t.z has type float. Thus, fp may point to sp->t.z. a single stack fo r the nearest dominating tuple that The member sp->t contains sp->t.z. Thus, the signa­ might write the same memory as the tuple in question ture fo r tJ1e effects-class fl oat<0,3> contains ilie effects­ references could lead to O(N9 performance, where N class struct 5<4,11>. This reflects the fa ct that the is the number of tuples in the dominator chain. This tore to sp->t.y depends upon the store tJ1 rough fp, � worst-case behavior occurs when none ofilie tuples in I.e., It must occur afterilie store ilirough fp. a dominator chain affects any su bsequent tuple in the Even though the signature fo r the effects-class chain. Each time the side-effects package searches the float<0,3> contains the effects-class struct 5<4 ll> stack, it examines all the tuples in the stack. (sp->t), it does not contain the effects-class s�ruct To avoid iliis, ilieDEC C and C++ side-effe cts pack­ 5<4,7> (sp->t.y). There is no float member of struct age creates a stack fo r each effects class. When pushing 5 whose position within struct 5 overlaps bytes 4 a tuple, the side-effe cts package pushes the tuple on through 7 ofstruct 5. There is a floatmember of struct each stack associated with an efTects class in the tuple's 5, namely z, whose position within struct S overlaps effects-class signature. When the GEM optimizer tells bytes 4 through 11 of struct S. The signature fo r the th e side-effe cts package to find the nearest domina ti na effects-class float<0,3> would not contain the effects­ write fo r a tuple, the side-effects package need onl� class s<0,3> if it existed. There is no float member of s choose the nearest of those tuples that are on the top whose position overlaps bytes 0 ilirough 3 of s. of the stacks associated with ilie tuple's effects-class signature . It need only look at the top of each stack, Additional Effects-class Signatures The side-effe cts because a tuple would not be in tJ1e stack unless it package creates a special effects-class signature repre­ might affect objects in the effects class associated with senting the side effects of a call. A called procedure tJ1e stack. may reference the fo llowing: The multistack worst-case behavior is O(NC). There • Any pointer-aliased symbol (by means of a refer­ are C separate stacks, one fo r each effects class. The ence through a pointer) effects-class signature fo r each effects class may con­

• Any allocated object (by means of a reference tam all the other effects classes. This would mean that ilirough a pointer) each of the N tuples in the dominator chain would appear in each of ilie stacks. • Any nonlocal symbol (by means of direct access) Although the worst-case behavior fo r the multistack • Any local static symbol (by means of recursion) case is no better than the single-stack case ( C may be e uaJ to N ), in practice there are often more tL;ples The effects signature fo r a call incl udes all the effects � classes representing these objects. Withm a routine than effects classes. Furthermore )

Digital Technical Journal Vo l. 10 No. 1 1998 53 effects-class signatures often contai n a small number Effectiveness of effects classes. A small number of effects classes in an effects-class signature means that there are a small The benchmark programs from the SPECint95 suite number of stacks to consider. Choosing the nearest offe r some convenient test cases fo r measming the dominator fr om among the top tuples on these stacks effectiveness of type-based alias analysis. The sources are requires examining only a small number of tuples. readily available and portable. The programs conform to aLias rules established by the American National Cost of Using Ty pe Information Standards Institute (ANSI) and are compute intensive. Unfortunately, they do not contain floating-point cal­ When compiling all of the SPECint95 test suite9 using culations. This reduces the number of different types high optimization, alias analysis accounts fo r approxi­ used in the programs. Type -based alias analysis works mately 5 percent of the compilation ti me. The use of best when there are many di fferent types in use. Standard C type rules during alias analysis increases Tlu·ee of the SPECint95programs show no improve­ compilation time by less than 0.2 percent (time mea­ ment when compiled using the Standard C typing rules sured in number of cycles consumed by the compiler as opposed to using the traditional C typing rules. as reported by Digital Continuous Profiling Infra­ These programs, namely compress, go, and li, do not structure [DCPI] '"). The increase in compilation time use many different types and pointers to them. \Vh en varies fr om program to program but never exceeds all the pointers in a program are pointers to ints (go), 0.5 percent. Handling the extra effects classes gener­ there is only one effects class fo r all pointer accesses. ated by using Standard C type aliasing information Because the compiler has no way to differentiate accounted fo r most ofthe increase . among the objects touched by a dereference of a Potentially, the cost of including ty pe-aliasing infor­ pointer expression, it generates identical code fo r these mation could be huge. Calculating which effects classes programs, regardless of the type rules used. The gen­ a reference through a char * pointer could touch is erated code fo r li diffe rs only sl ightly and only fo r straightforward as shown by the algorithm in Figure 5. infrequently executed routines. A much more complicated process is required to Changes in generated code fo r the remaining five calculate which effects classes could be touched by a benchmarks are more prevalent. Two benchmarks, reference through a pointer to a type other than char. ijpeg and perl, show a small reduction in the number The algorithm in Figure 6 performs this process. of loads executed but no meaningful reduction in the Fortunately, the innermost section of this loop is total number of instructions executed. The other rare ly executed . The innermost section executes onJy three SPECint95 benchmarks show varying degrees if a routine references a structure either through a of reduction in both the number of loads executed pointer or a pointer-aliased symbol, that structure (see Ta ble 6) and the total number of instructions contains a substructure, and the routine references the executed (see Ta ble 7). su bstructure through a pointer.

forea ch p o inter al iased symbol foreach effects class represen ing a region of the symbol add chat effec s lass to the effects class signa ure for c ar

Figure 5 Calculation of the Effects-class Signature of the Type char *

foreach p oi nter aliased s ymbo l or cype referenced t hrough a ointer foreach member here i n i f t he member ' s type is referenced through poin er foreach effeccs class repre senting are ion o f the member ' s Lype foreach ef fects class rep resen t i g a region f che s ym bo l or type referenced hrough a pointer if the tHO effects class re i on s ove rlap add the symbol ' s or po i nter ' s effects class to the effects class signatur associated with the effec t cl ss representin the member ' s Lype

Figure 6 Calculation of the Effects-class Signature fo r Types Other Than char

54 DigiraJ Te chnical JournaJ Vol. 10 No. 1 1998 Ta ble 6 Number of Loads Executed by the Select SPECint95 Benchmarks

Millions of Loads Millions of Loads SPEC Benchmark Using Ty pe Information without Ty pe Information Percent Reduction gee 10,268 10,365 0.9 ijpeg 16,853 16,888 0.2 m88ksim 13,889 14,157 1.9 peri 11,260 11,296 0.3 vortex 18,994 19,207 1.1

Ta ble 7 Number of Instructions Executed by the Select SPECint95 Benchmarks

Millions of Instructions Millions of Instructions SPEC Benchmark Using Ty pe Information without Ty pe Information Percent Reduction gee 42,830 42,935 0.2 ijpeg 82,844 82,834 0.0 m88ksim 72,490 73, 155 0.9 peri 45,219 45,252 0.1 vortex 80,093 80,607 0.6

The load and instruction counts are those reported Thus, it must generate code that loads cinfo->min_ by using Atom's pixie tool on the SPECint95 binaries DCT_scaled_size twice. The Standard C type system to generate pixstat data. 11•11 The compiler used was a allows the compiler to generate only one load of development C compiler. All compilations used the cinfo->min_DCT_scaled_size. fo llowing switches: - fast , -04 , -a rch ev56 , and Several of the benchmarks contain code similar to - i nl i ne peed . The compilations using the the fo llowing fr om conversion_rccipe in gee: C -ansi_alia Standard type system used the c rr . ne. . t- lis ->opcode ; -1; switch. The compilations using the traditional C type ur . ne x >l is -> o - from ; system used the -noansi_ali s switch. The bench­ curr . ex t >list- cost - 0 ; curr . exc lis - >prev - 0; mark binaries were run using the reference data set. DCPI'" measurements of the reduction in the num­ Using traditional C type rules, the compiler must gen­ ber of cycles consumed by these SPECint95 bench­ erate fo ur loads ofcurr.next->list. The compiler must marks showed no consistent reductions. Run-to-run assume that the pointer curr.next->list may point to variability in the data collected swamped any cycle­ itself, making curr.next->list->member an alias to r time reductions that might have occurred. Similarly, curr.next->list. The Standard C type rules allow the measurements of gains in SPECint95'' results due to compiler to assume that curr.next->list does not point the use of type information during alias analysis showed to itself. This allows the compiler to generate code that no significant changes. reuses the result of the first load of curr.next->list, eliminating three redundant loads. Changes in Generated Code In another example in gee, the use of Standard C type rules allows the compiler to move a load outside a The code-generationchan ges one sees in the SPECint95 loop. The fo llowing loop occurs in fixup_gotos: benchmarks arc exactly what one would expect. for (; 1 isLs ; lists - TREE_ H.l\I (li s ts ) ) The usc of type information during alias analysis if ! T REE_CHAI (lis s) reduces the number of redundant loads. An example -- this bloc k- > . b l oc k . ou er_c l.e nup ·) TREE_ADDRES ABLE (li Ls ) • 1 of this occurs in ijpeg, which contains the code sequence: main->r v?group_ctr· Standard C type rules tell the compiler that the store

= fJDH1EN ·ron) ( c info- >min_OCT_s ca l ed_s ize • ll ; generated by TR.EE_ADDRESSABLE (lists) = I main-�rowgrou s_ava il cannot modi�' thisblock->data.block.outer_clcanups. ; ( JDI M� SlO I ) (cinfo->mi n_DCT_scal d_size + 2) ; This allows the compiler to generate code that re tches in process_data_context. Using the traditional C type thisblock->data.block.outer_cleanups once betore system, the compiler must assume that main->row entering the loop. Using traditional C type rules, group_ctr is an alias to r cinfo->min_DCT_scaled_size. the compiler must generate code that fe tches

Digital Te chnid journal Vo l. 10 No. 1998 55 thisblock->data.block.outer_cleanups each time it A recent example of this problem occurred in the traverses the loop. gee program in the SPECint95 benchmark suite. All Not only can type information reduce the number programs in this suite arc supposed to conf(xm to the of redundant loads, it em reduce th e number of redun­ Standard C type-aliasing rules. Becll!se of an improve­ dant stores. In m88ksim, there are many routines simi­ ment to the GEM optimizer, this benchmark started lar to the fo llowing: to give unexpected results. In rrx_alloc, gee clears a

ir:t ffirst gen . opcl = 0. 3c: ; ing this structure, gee assigns a value to one of the ptr-·.gen . r.:cs t • operar.d,; . v-lue [ O ] ; fields in the structure. Through a series of valid opti­ p•t -"9'�. OJX/ • am-�op,c . nT; p�r >gen . n;/. - operan .,; . alue [ 1 I ; mizations (given the incorrect type information), the retuLr.(0 l ; resulting code did not clear all the fields in the struc­ ture. This left uninitialized data in the structure where ope 1, dest, opc2, and src2 Jrc bit fieldssh aring resulting in gee behaving in an unexpected manne/

the same 32 bits (long-vord). Using traditional C typ­ To avoid potential problems, the DEC C compile r, ing rules, ptr->gcn and cmd->opc may be aliases fo r by def:1ult, does not use the Standard C type rules each other. Thus to implement the above routine, the when performing alias analysis. The user of the com­ compiler must generate code that performs the fo l­ piler has to explicitly assert that the program does fo l­ lowing actions: low the Standard C type rules through the use of a command-line switch. • Load ptr->gen The DIGITAL C++ compiler docs assume that the • Update bit fieldsptr ->gen.opcl and ptr->gen.dest C++ program it is compiling ad heres to the Standard

• Store ptr->gcn C++ type rules. A user of the DIGITAL C++ compiler can use a command-line switch to inkmn the compiler • Load cmd->opc.rrr that it should use traditional C type rules when per­ • Update bit fields ptr->gen.opc2 and ptr->gen.src2 fo rming alias analysis. • Store ptr->gen Summary Using Standard C typing rules, the compiler does not have to generate the firststore ofptr->gen. The assign­ ments to ptr- >gcn .opcl and ptr->gen.dest cannot Using Standard C type inf(xmation during alias analysis change cmd- >opc.rrr. In this case, alias analysis that is does improve the generated code f()r some C and C++ not type based would have a difficult time detecting programs. The compilation cost of· using type informa­ that ptr- >gen and cmd->opc do not alias each other. tion is small. Except fo r rare cases, performance gains M88ksim never calls Hi rst directly. It calls it by means resulting from these code improvements are small. Any of an array-indexed fu nction pointer. programs compiled using type information during alias analysis must strictly adhe re to the Standard C and C++ If A Note of Caution aliasing rules. not, the optimizer may generate code that produces unexpected results.

Many C programs do not ad here to the Standard C Acknowledgments aliasing rules. Through d1e usc of expucit casting and implicit casting, they access objects of one type bymeans of pointers to other types. More aggressive optimization The author would like to than k Dave Blickstein, Mark by GEM combined with more detailed alias-analysis Davis, Neil Faiman, Steve Hobbs, and Bill Noyce of GEM information fi·om the DEC C and C++ side-effects the team for their advice and reviews of this package increasingly results in these programs exhibit­ work. Dave Blickstein and Neil Faiman also did work ing unexpected behavior when the compiler uses in the GEM optimizer to ensure that the DEC C and Standard C aliasing rules. C++ side-effe cts package had all the inf(mnation it to Passing a pointer to one type to a routine that needed to do alias analysis correctly and ensure that the GEJ\rl optimizer effectively used the infcm1ution expects a pointer to another type works as expected, the side-effe cts package provided. Thanks also to John until the GEM optimizer in lines the called procedure. If the procedure is not in lined, the DEC C and C++ Henning of the CSD Performance Group and Jeannie Lieb of the GEM team fiJ r their help using the side-effects package must assume that the call conflicts A with aJ I pointer accesses before and afterthe cal l. Once SPECint95 benchmark suite. f-i nal word of thanks goes to Bob Morgan f(x suggesting that I write this GEM inlines the routine, the side-effects package is paper and to my management f() r supporting my free to assume that references using the in lined pointer do not conflict with references using the pointer at the doing so. call site. The two pointers point to t:\vo diffe rent types.

56 Digital Tc chnic;JIJound Vo l. 10 No. l 1998 References and Notes Biography

1. R. Wilson and M. Lam, "Eni cient Comext-Sensitive Poimcr Analysis fo r C Programs," Proceedini�S of the AC/\1 S!C;PLA 1\ '95 Co nference on Prog ramming Lan­ guaw' IJesip,n and Implementation. La Jolla, Calif. (June 1995): 1-12.

2. D. Coutant, "Rctargetable High-Level Alias Analysis," Proceedings ofthe 13th Annual �) mposium on Pl7n­ ciples oj ' Programming Languages, St. Petersburg Beach, Fla. (January 1986): 1 10-1 18.

3. A. Diwan et al., "Type-Based Alias Analysis," Procecd­ iu,�s o/ the 1998 ACM SICPLA N Co uference o11 Pro­ August G. Re inig f:iramming Language Desig n and Implementation. August Reinig is a principal somvarc engineer, currently Montreal, Canada (June 1998 ): 106-1 17. working on debugger support in the DIGITAL C++ compiler. In addition to his work on the DEC C and C++ 4. Joint Technical Committee ISO/IEC JTC 1, "The C side-effects package, August implemented a Java-based Programming Language," International Standard distributed test system fo r t.he DEC C and DIGITAL C++ · !SO/JJ;'C 9899 1990, section 6.3 Expressions. compilers and a parallel build system for the DEC C and DIGITAL C++ compilers. The distributed test system 5. "Working Paper fo r Draft Proposed International simultaneously runs multiple tests on different machines Standard for Information Systems-Programming and is fa ult tolerant. Betorejoinin g the DEC C and C++ Language C++," WG2 1/N1146, November 1997, team, he conu·ibuted to an advanced development incre­ section 3.10. mental compiler project, which led to two patents, "Method and Apparatus fc> r Somvare Testing Using a 6. D. Blickstein et al., "The GEM Optimizing Compiler Testing Technique to Test Compilers" and "Method System," /Ji[;ilal Te chnical.fournal, vol. 4, no. 4 ( Spe­ and Apparatus to r Testing Somvare." He earned a B.S. in cial Issue, 1992): 121-136. mathematics (magna cum laude) !TomDartmouth College in 1980 and an M.S. in computer science fi·om Harvard 7. R. Crowell ct al., "The GEM Loop Transformer," University in 1997. He is a member of Ph.i Beta Kappa. ni,� ital TechnicaiJournal, vol. 10, no. 2, accepted for publication.

8. A. AJ1o,R. Sethi, and ]. Ullman, Compilers Principles. Tec hniljlles. and To ols (Reading, l'vbss: Addison­ vVesley, 1986 ): 104.

9. Inf ormation about the SPEC benchmarks is available from the Standard Pertorm<\nce Evaluation Corpora­

tion at http:// www.specbench.org/.

10. J. Anderson ct al., "Continuous Profiling: vVhcre Have All the Cycles Gone>" Proceedings of the Sixteenth AO\If .S)'mposium on Op eratln/:5 Systl!rn Principles, Sait­ M::tlo,France (October 1997): 15-26.

II. A. SrivJstava and A. Eustace, "ATOM: A System fo r Building Customized Program Analysis Tools," Pro­ ceedings of t be .-10\lfS!CPL- !:V 9·1 Co nference on Pro­ wwn ming Language Desig n Ulld !mplemenlalion. Orlando, Fla. (June 1994): 196-205.

12. l/i\1/IPS- V Rejere11ce Manual (p ixie and pixstats) (Sunnyvale, Calif.: MIPS Computer Systems, 1990).

Digital Technical Joumal Vol. 10 No. 1 1998 57 I

Philip H. Sweany Steven M. Carr Compiler Optimization Brett L. Huber for Superscalar Systems: Global Instruction Scheduling without Copies

The performance of instruction-level parallel Many oftoday's computer applications require compu­ systems can be improved by compiler programs tation power not easily achieved by computer architec­ that order machine operations to increase tures that provide little or no parallelism. A promising alternative is the parallel architecture, more specifically, system parallelism and reduce execution time. the instruction-level parallel (ILP) architecture, which The optimization, called instruction scheduling, increases computation during each machine cycle. ILP is typically classified as local scheduling if only computers allow parallel computation of the lowest basic-block context is considered, or as global level machine operations within a single instruction scheduling if a larger context is used. Global cycle, including such operations as memory loads and scheduling is generally thought to give better stores, integer additions, and floating-point multiplic:�­ tions. ILP architectures, like conventional architectures, results. One global method, dominator-path contain multiple fu nctional units and pipclined fi.mc­ scheduling, schedules paths in a function's tional units; but, they have a singJc progr:�m counter dominator tree. Unlike many other global and operate on a single instruction stream. Compaq scheduling methods, dominator-path schedul­ Computer Corporation's AlphaServer system, based on ing does not require copying of operations the Alpha 21164 microprocessor, is :�n example of an to preserve program semantics, making this ILP machine. To effectively usc parallel hardware and obtain method attractive for superscalar architectures performance ad van tagcs, compiler programs must that provide a limited amount of instruction­ idcntif)r the appropriate level of parallelism. For ILP level parallelism. In a small test suite for the architectures, the compiler must order the single Alpha 21164 superscalar architecture, dominator­ instruction stream such that multiple, low-level opera­ path scheduling produced schedules requiring tions execute simultaneously whenever possi ble. This 7.3 percent less execution time than those pro­ ordering by the compiler of machine operations to effectively use an I LP architecture's increased paral­ duced by local scheduling alone. lelism is called instruction schedulin,r, . It is an opti­ mization not usuallv ro und in compilers fo r non-ILP arch i tcctu res. Instruction scheduling is classified as local if it considers code only within a basic block and ,r, loha! if it schedules code across multiple bJsic blocks. A dis­ advantage to local instruction scheduling is its inability to consider context from surrounding blocks. \Vhile local scheduling can find parallelism within a basic block, it can do nothing to exploit parallelism bel:\veen basic blocks. Generally, global scheduling is preferred because it can take advantage of added program parJl­ lelism available when the compiler is :�!lowed to move code across basic block bmmdJries. Tj aden and Flynn,' to r example, fo und parallelism within a basic block quite limited . Using a test suite of scientific programs, they measured an average parallelism of 1.8 within basic blocks. In similar experiments on scientific pro-

58 Digital Tc chtlical journal Vo l. 10 No. I 1998 grams in which the compiler moved code across basic later than Y. These DOD edges are b;�sedon the fo rmal­ block boundaries, Nicolau and Fisher' ro und paral­ ism of data dependence analysis. There are tl1ree basic lelism that ranged from4 to a virtuallyunlim ited num­ types of data dependence, as described by Padua et al .'' ber, with an average of90 fo r the entire test suite. • Flow dependence, also called b· ue dependence or Trace scheduling'' is a global scheduling technique data dependence. A DDD node M, is flow depen­ that attempts to optimize fi:equently executed paths of dent on DDD node M, ifM, executes before M, and a program, possibly at t11e expense of less frequently lvLwrites to some memory location read by M,. executed pat11s . Trace scheduling exploits parallclis� • Antidependence, also called fa lse dependence. A within sequential code by allowing massive migration of DDD node M2 is antidependent on DDD node M, operationsacross basic block bounda.ties during schedul­ if M, executes before Mz and M2 writes to a mem­ ing. By addressing thisla rger scheduling context (many ory location read by M , , thereby destroying the basic blocks), trace scheduling can produce better sched­ ules tlun teclmiques that address the smallercontext of a value needed by M,. single block. To ensure the program sema.t1tics are not • Output dependence. A DDD node M, is output changed by interblock motion, trace scheduling inserts dependent on ODD node M, ifM, executes before copies of operations that move across block boundaties. M2 and M1 and M, both write to the same location. Such copies, necessary to ensure program semantics, are To fa cilitate determination <1 11d manipulation of called wmpm1sation copies. data dependence, the compiler maintains, fo r each The research described here is driven by a desire to DDD node, a set of all memory locations used (read) develop a global instruction scheduling technique and all memory locations defined (written) by that that, like trace scheduling, allows operations to cross particular DDD node. block boundaries to find good schedules and that, Once the DDD is constructed, the second phase unlike trace scheduling, does not require insertion of begins when list scheduling orders the graph's nodes compensation copies. Like trace scheduling, DPS first into the shortest sequence of insb· uctions, subject to defines a multiblock context fo r scheduling and then ( 1) the constraints in the graph, and (2) the resource uses a local instruction scheduler to treat the larger limitations in the machine (i.e., a machine is typically context like a single basic block. Such a techniq ue pro­ umited to holding only a single value at any ti me). I� vides effective schedules and avoids the performance genera! list scheduling, :.1.nordered Jist of tasks, called a cost of executing compensation copies. The global pnoriz)l list, is constructed . The priority list takes its scheduling technique described here is based on the name fr om the ta ct that tasks are r:mked such that those dominator relation * among the basic blocks of a fu nc­ with the highest priority are chosen first. In the context tion and is called dominator-path scheduling (DPS). oflocal instruction scheduling, the priority list contains DDD nodes, all of whose predecessors have already Local Instruction Scheduling been included in the schedule being constructed .

Since DPS relies on a local instruction scheduler we Expressions, Statements, and Operations begin with a brief discussion of the local schedt;ling problem. As the name implies, local instruction sched­ Within the context of this paper, we discuss algorithms uling attempts to maximize parallelism within each fo r code motion. Before going further, we need to basic block of a fu nction's control rl ow graph. In gen­ ensure common understanding among our readers tor eral, this optimization problem is NP-complete.' our use of terms such <�S expressions. statements. and However, in practice, heuristics achieve good results. operations. To start, we consider a computer program ( L..1.ndskov et al.'' give a good survey of early instruction to be a ltst of operations, each of which (possibly) scheduling algorithms. Al lan et aF describe how one computes a right-hand side (rhs) v;�lue and assigns the might build a retargetable local instruction scheduler.) rhs value to a memory location represented by a left­ L1st schedulinp, " is a general method often used to r hand side (lhs) variable. This can be expressed as local instruction scheduling. Briefly, list scheduling _ typtc::d ly requtres two phases. The first phase builds A�E a directed acyclic graph (DAG), c<�lled the d:J.tJ. depen­ where A represents a single memory location and E dence DAG (DDD), to r each basic block in the represents an expression with one or more operators fu nction. DDD nodes represent operations to be and an appropri:ue number of oper;�nds. During dif­ scheduled . The DDD's directed edges indicate that a fe rent phases of a compiler, operations might be repre­ node X preceding a node Y constrains X to occur no sented <�s

• Source code, a high-level langu<�ge such as C *A basic bl_ock, D, dominates another block, B, ifcl'<.:n path fr om the root ot the control-How graph (or a function ro B must pass • Intermediate statements, a linear fo rm of three­ throug;h D address code such as quads or n-tuples'"

Digital Te chnical Journal Vo l. LO No. 1998 59 • DDD nodes, nodes in a DDD, ready to be sched­ mat previous use. Thus, dominator analysis computes uled by the instruction scheduler me zuse set fo r each basic block and tor me idef set. The iuse set fo r a block, B, is that set of variables used Important to note about operations, whether repre­ on some path between B's immediate dominator and sented as mtermediate statements, source code, or B. Using the idefand iuse sets, dominator analysis com­ DDD nodes, is that operations include both a set of putes an approxinute birtl1point fo r each operation. definitionsand a set of uses. In this paper, we use the term dominator analysis Expressions, in contrast, represent the rhs of an to mean the analysis necessary to allow code motion of operation and, as such, include uses but not defini­ opera�ons while disallowing compensation copies. tions. Throughout this paper, we use the terms state­ Additionally, we use the term dominator motion fo r ment. intermediate statement, operation, and DDD . the �eneral optimization of code motion based upon node Interchangeably, because they all represent an dommator analysis. operation, with both uses and definitions, albeit gen­ erally at different stages of the compilation process. Enhancing the Reif and Ta rjan Algorithm When we use the term expression, however, we mean By enhancing Reif and Tarjan 's algorithm to compute an rhs with uses only and no definition. hi11hpoints of operations instead of expressions, we make several issues important that previously had no Dominator Analysis Used in Code Motion effect upon Reif and Tarjan's algorithm. This section motivates and describes the information needed to In order to determine which operations can move allow dominator motion, including the use, def iuse, across basic block boundaries, we need to analyze the and ide{ sets for each basic block. An algorithmic source program. Although there are some choices description of this dominator analysis information is as to tl�e exact analysis to perform, dominator-patl1 included in the section Overview of Dominator-Path scheduhng IS ?ased upon a fo rmalism firstdescribed by Scheduling and the Algorimm tor Intet·block Motion. Retf and Taqan." We summarize Reif and Tarjan's \V:henwe aLlow code motion to move intermediate work here and then discuss the enhancements needed statements (or just expressions) from a block to one of to allow interblock movement of operations. its dominators, we run the tisk that the statement In their 1981 paper, Reif and Tarjan provide a fa st (expression) will be executed a diffe rent number of algorithm fo r determining the approrimate hirthpoints times in the dominator block than it would have been of expressions in a program's flow graph. An expres­ in its original location. vVhen we move only expres­ sion's birthpoint is the fi rst block in the control flow sions, the risk is acceptable (although it may not be graph at which the expression can be computed, and efficient to move a statement into a loop) since the the value computed is guaranteed to be the same as in value needed at the original point of computation is the original program. Their technique is based upon preserved. Relative to program semantics, the number fa st computation of the idefset fo r each basic block of of times the same value is computed has no effect as the control flow graph. The idef set fo r a block B is long as the correct value is computed the last time. that set of variables defined on a path between B's This accuracy is guaranteed by expression birthpoints. immediate dominator and B. Given that the domina­ Consider also the consequences of moving an expres­ tor relation fo r the basic blocks of a function can be sion Jiom a block that is never executed fo r some partic­ represented as a dominator tree, the immediate domi­ ular input data. Again,it may not be efficientto compute nator, IDOM, of a basic block B is B's parent in the a value never used, but the computation does not alter dominator tree. progran1 semantics. \Vhen dominator motion moves Expression birth points are not sufficientto allow us entire statements, however, the issue becomes more to safely move entire operations from a block to one of complex. Ifthe statement moved assigns a new value to its dominators because birthpoints address only the an induction vatiable, as in me following exatnple, movement of expressions, not definitions. Operations in general include not only a computation of some n= n + 1 expression but the assignment of the value computed dominator motion would change n's fi nal value if it to a program variable. Ensuring a "safe" motion tor an moved the statement to a block where the execution expression requires only that no expression operand fr equency diffe red from that of its original block. We move above any possible definition of that operand, could alleviate this problem by prohibiting motion of thus changing the program semantics. A similar any statement fo r which the use and de{ sets are not requirement is necessary, but not sufficient, fo r the disjoint, but the possibility remains that a statement variable to which the value is being assigned. In addi­ may ddine a variable based indirectly upon that vari­ tion to not moving A above any previous defirution of able's previous value. To remedy the more general A, A cannot move above any possible use of A. problem, we disallow motion of any statement S Otherwise, we run the risk of changjng A's value fo r ) )

60 Digital Technical JournaJ Vol. 10 No. l 1998 whose def set intersects with those variables that are used-befo re-defined in the basic block in which S resides. Suppose the optimizer moves an intermediate state­ ment that defines a global variable from a block that may never be executed fo r some set of input data into a dominator block that is executed at least once fo r the same input data. Then the optimized version has defined a variable that the unoptimized function did not, possibly changing program semantics. We can be B4 sure that such motion does not change the semantics I I of that fu nction being compiled; but there is no mech­ I anism, short of compiling the entire program as a sin­ t t gle unit, to ensure that defining a global variable in this B5 function will not change the value used in another 7 fu nction. Thus, to be conservative and ensure that it does not change program semantics, dominator qJ motion prohibits interblock movement of any state­ ment that detines a global variable. At first glance, it - G may seem that this prohibition cripples dominator motion's ability to move any intermediate statements gJ at all; but we shall see that such is not the case. B8 One final addition to Reif and Tarjan information is required to take care of a subtle problem. As discussed above, dominator analysis uses the idef and iuse sets to Figure 1 prevent illegal code motion. The use of these sets was Control Flow Graph fo r the Function ti ndmax() assumed to be sufficient to ensure the legality of code motion into a dominator block; unfortunately, this is not the case. The problem is that a definition might pass through the immediate dominator ofB to reach iuse set. This will prevent any definition of V that a use in a sibling of B in the dominator tree. If there might exist in B fr om moving up. If there is a defini­ were a detlnition of this variable in B, but the variable tion ofV in B, but Vis live-in to B, there must be some was not defined on any path from the immediate dom­ use ofV in B before the definition,so it could not move inator, there would be nothing in dominator analysis upward in any case. to prevent the definition from being moved into the dominator. But that would change tl1e program's Measurement of Dominator Motion semantics. Figure 1 shows tl1e control-flow graph fo r a To measure the motion possible in C programs, function called findmax(), with only the statements Sweany1' defineddominat or motion as the movement referring to register r7. Register r7 is definedin blocks of each intermediate statement to its birthpoint as B3 and B7, and referenced in B9. This means mat r7 defined by dominator analysis and by the number of is live-out of B5 and live-in to B8, but not live-in to dominator blocks each statement jumps during such B7; there is a definition of r7 in B3 that reaches B8. movement. Sweany's choice of intermediate state­ Because there is no definitionor use between B7 and ments (as contrasted with source code, assembly lan­ its immediate dominator B5, the idef and iuse sets of guage, or DDD nodes) is attributed to the lack of B7 are empty; thus, dominator analysis, as described machine resource constraints at that level of program above, would allow the assignment of r7 to move abstraction. He envisioned dominator motion as an upward to block B5. This motion is illegal; it changes upper bound on the motion available in C programs the definitionin B3. Moving me operation from B7 to when compensation copies are included. In the test B5 changes the conditional assignment of r7 to an suite of 12 C programs compiled, more than 25 per­ unconditional one. cent of all intermediate statements moved at least one To prevent this from happening, we can insert the dominator block upwards toward the root ofthe dom­ variable into the iuse set of the block B, in which we inator tree. One function allowed more than 50 per­ wish the statement to remain. We do not, however, cent of the statements to be hoisted an average of want to add to the iuse set unnecessarily. The solution nearly eight dominator blocks. The considerable is to add each variable, V, that is live-in to any of B 's amount of motion (without copies) available at the siblings in tl1e dominator tree,but not into B, or to B's intermediate statement level of program abstraction

Digital Technical Journal Vol. 10 No. 1 1998 61 provided us with the motivation to use similar analysis paths. A third is actual profiling of the running pro­ techniques to fa cilitate global instruction scheduling. gram. We visit this issue again in the section Choosing Dominator Paths. First, however, we need to discuss Overview of Dominator-path Scheduling and the the algorithmic details ofDPS. Algorithm for lnterblock Motion Once DPS selects a dominator path to schedule, it requires a method to combinethe blocks' DDDs into Since experiments show that dominator analysis allows a single DDD fo r the entire dominator path. In our considerable code motion without copies, we chose to compiler, this task is performed by a DDD coupler,�.' use dominator analysis as the basis tor the instruction which is designed for the purpose. Given the DDD scheduling algorithm described here, namely dominator­ coupler, DPS proceeds by repeatedly path scheduling. As noted above, DPS is a global • Choosing a dominator path to schedule instruction scheduling method that does not require copies of operations that move ti.-om one basic block to • Using the DDD coupler to combine each block's another. DPS performs global instruction scheduling by DDD on the chosen dominator path

treating a group of basic blocks fo und on a dominator • Scheduling tl1e combined DDD as a single block tree path as a single block, scheduling the group as a The dominator-path scheduling algorithm, detailed whole. In this regard, it resembles trace scheduling, in this section, is summarized in Figures 2 and 3. \vhich schedules adjacent basic blocks as a single block. A significantaspect of the DPS process is to ensure DPS's fo undation is scheduling instructions while mov­ "appropriate" interblock motion of DDD nodes and ing operations among blocks according to both the to prohibit "illegal" motion. As noted earlier, the opportunities provided by and the restrictions imposed combined DDD fo r a dominator path includes control by dominator analysis. flow. Therefore, when DPS schedules a group of The question arises as to how to exploit dominator blocks represented by a single DDD, it needs a mecha­ analysis information to permit code motion at the nism to map correctly the scheduled instructions to instruction level during scheduling. DPS is based on the basic blocks. The mechanism is easily accom­ the observation that we can use ide( and iuse sets to plished by tl1e addition of two special nodes to each allow operations to move from a block to one of its block's n"DD. Called BlockStart and BlockEnd, these dominators during instruction scheduling. Instruction special nodes represent the basic block boundaries. scheduling can then choose the most advantageous Since dominator-path scheduling does not allow position tor an operation that is placed in any one of branches to move across block boundaries, each several blocks. Because machine operations are incor­ BlockStart and BlockEnd node is initially "tied" (witl1 porated in nodes of the DDD used in scheduling and, DDD arcs) to the branch statement of the block, .if any. like intermediate statements, DDD nodes are repre­ Because BlockStart and BlockEnd are nodes in the sented by dej and use sets, the same analysis performed eventually combined DDD, they arc scheduled like all on intermediate statements can also be applied to a other nodes of the combined DDD. Aftersch eduling, basic block's DDD nodes. all instructions between the instruction containing the The same motivation that drives trace scheduling­ BlockStart node fo r a block and the instruction con­ namely that scheduling one large block allows better use taining the BlockEnd node fo r that block are consid­ of machine resources than scheduling the same code as ered instructions fo r that block. Next, DPS must several smaller blocks-also applies to DPS.In contrast ensure that the BlockStart and BlockEnd DDD nodes to trace scheduling, DPS does not allow motion of remain ordered (in the scheduled instructions) relative DDD nodes when a copy ofa node is required and does to one another and ro the BlockStart and BlockEnd not incur the code explosion due to copying that trace nodes tor any other block.To do so, DPS adds use and scheduling can potentially produce. For architectures dej information to the nodes to represent a pseudore­ with moderate instruction-level parallelism, DPS may source, BlockBoundary. Because each BlockStart produce better results than trace sche�uling, because node defines BlockBoundary and each BlockEnd the more limited motion may be suttictent to make node uses BlockBoundary, no BlockEnd node can be good use of machine resources, and unlike trace sched­ scheduled ahead of its associated BlockStart node uling, no machine resources are devoted to execunng (because of flow dependence.) Also, a BlockStart node semantic-preserving operation copies. cannot be scheduled before its dominator block's Much like traces,* the dominator path's blocks can BlockEnd node (because of antidependence). By be chosen by any of several methods. One method is a establishing these imaginary dependencies, DPS heuristic choice of a path based on length, nesting ensures that the DDD coupler adds arcs between all depth, or some other program characteristic. Another BlockS tart and BlockEnd nodes. is program mer specification of the most important

•groups of blocks ro be scheduled rogerhcr in rrace scheduling

62 Digital Technical Journal Yo l. lO No. 1 1998 Algorithm Dominator-Path Scheduling Input: Function Control Flow Graph Dominator Tree Post- Dominator Tree

Output: Scheduled instructions fo r the function

Algorithm: While at least one Basic Block is unscheduled Heuristically choose a path B,, B1, ..., B, in the Dominator Tree that includes only unscheduled Basic Blocks.

Pe rform dominator analysis to compute lDefand IUse sets

/* Build one DDD to r the entire dominator path *I Combined DDD = B, For i= 2 to n

T = InitializeTransitionDDD ( B, ., , B,) CombinedDDD = Couple(CombinedDDD,T) CombinedDDD =Couple (CombinedDDD, B, )

Perform list sched uling on Combined DOD Mark each block ofDP scheduled Copy scheduled instructions to the Blocks of the path (instructions between the BlockStart and BlockEnd nodes fo r a Block are "written" to that Block) End vVh ile

Figure 2 Dominaror-pJth Scheduling Algorithm

Looking back to dominator analysis, we see that operations tl1at instruction scheduling allows. In dom­ interblock motion is prohibited if the operation being inator motion, intermediate stJtements move in only moved one direction, i.e., toward the top of the ti.mction's control How graph, not from a dominator block to a • Defines something that is included in either the dominJted one. This one-directional motion is rea­ ide/or iusc set sonable when attempting to move intermediate stJte­ • Uses something included in the idef set fo r the ments because one statement's movement will likely block in which the operation currently resides open possibilities to r more motion in the same direc­ To obtain the same prohibitions in the combined tion by other statements. When statements move in DOD, we add the ide("set to r a basic block, B, to the diffe rent directions, one stJtement's motion might defset B's BlockStart node. Similarly, we add the iuse inhibit another's movement in tl1e opposite direction. set to r B to the use set ofB's BlockStart node. Thus we The goal of dominator motion is to move statements as cntorcc the same restriction on movement that domi­ t�u·as possible in tl1e control flow graph. In contrast, tl1e nator analysis imposed upon intermediate statements goal of DPS is not to maxi.rn.ize code motion, but rather and ensure that any intcrb.lock motion preserves pro­ to find,fo r each operation, 0, that location fo r 0 that gram scmJntics. In J similar manner, DPS includes the will yield me shortest schedule. Thus our goal has restrictions on movement of operations that define changed fi:om that of dominator motion. To gain the either global vJriJbles or induction variables. Figure 3 full benefit from DPS, we wish to allow operJtions to gives an algorithmic description of the process of move past block boundaries in either direction. To per­ "doping" the BlockS ta rt and BlockEnd nodes to pre­ mit bidirectional motion, we use the post-dominJtor vent disal lowed code motion. relation, which says that a basic block, PD, is a post­ DPS is complicated by fa ctors not relevant to r dom­ dominator of a basic block B if all paths from B to the inator motion of intermediate statements. Foremost is function's exit must pass ilirough PD. Using this strat­ the complexity imposed by the bidirectional motion of egy, we similarly definepost-ide fand post-iuse sets. In

Digital TcdHlic.ll Journal Vo l. 10 No. l 1998 63 Algorithm InitializeTransitionDDD(B,, B1) Input: A Transition DDD templates, with a Dummy DDDNode fo r B, 's block end and one fo r B,'s block start Two basic blocks, B, and B, that we wish to couple Dominator Tree Post- Dominator Tree The fo llowing dataflow information Def, Use, IDef, and I Use sets fo r B, and B, Used-Before-Definedset fo r B, Post-IDef, and Post-I Use sets fo r B, and B, B,'s "sibling" set, definedto include any variable live-in to a dominator-tree si bling ofB,, but not live-in to B, A basic block DDD fo r each of B, and B, Output: An initialized Transition DDD, T Algorithm:

T = TransitionDDD /* "Fix" set fo r global and induction variables. * / Add set of global variables to B/s !Use Add B/s Used-Before-Defined to B/s IUse Add B/s sibling set to B/s IUse

If B, does not post-dominate B, Add B, 's Use set to Ts Block End Def set Add B, 's Defset to T's BlockEnd Use set Else Add B, 's Post-IDef set to T's BlockEnd Def set Add B,'s Post-lUse set to T's BlockEnd Use set Add B/s IDefset to T's BlockS tart Def set Add B� 's I Use set to T's BlockS tart Use set Return T

Figure 3 Initialize Transition ODD Algorithm

fa ct, it is not difficult to compute Jll these quantities sor, S, in the fo rward dominator p:tth does not post­ fo r a fu nction. The simplest w:�y is to logically reverse dominate B, DPS adds B's de( set to the use set of the the direction of all the control flow gr:�ph arcs and per­ BlockEnd node associated with B. In similar t-:1shion, fo rm dominator analysis on the resulting graph. we add B 's use set to B 's BlockEnd node's de( set. Having computed the post-dominator tree, DPS This technique prevents any DDD node originally in chooses dominator paths such that the dominated B fr om moving downward in the dominator path . node is a post-dominator of its immediate predecessor in a dominator path. This choice allows operations to Choosing Dominator Paths move "freely" in both directions. Of course, this may be too limiting on the choice of dominator paths. To DPS allows code movement along any dominator allow fo r the possibility that nodes in a dominator path path, but there are many ways to select these paths. An will not fo rm a post-dominator relation, DPS needs a investigation of the effects of dominator-path choice mechanism to limit bidirectional motion when on the efficiency of generated schedules tells us that needed. Again, we rely on the technique of adding the choice of path is too important to be leftto arbi­ dependencies to the combined DDD. In this case trary selection; twice the average percent speedup* fo r (assuming that DPS is scheduling paths in the fo rward several functions can often be achieved with a simple, dominator tree), fo r any basic block, B, whose succes- *( unoptimized_speed - oprirnized_spccd )/unoptirnizcd_spccd

64 Digiral T�c hnical JournJI Vol. 10 No. I 1998 well-chosen heuristic. Some functions have a potential tion ofDPS and the number of clistinct dominator tree percent speedup almost fo ur times the average. Thus, partitionings. The original implementation of DPS it is important to find a good, generally applicable included a single, simple heuristic to choose domina­ heuristic to select tl1e dominator paths. tor patl1s. More specifically, to choose dominator pams Unfortunately, it is not practical to schedule all of witl1 in a group, G, of contiguous blocks at me same the possible partitionings fo r large fu nctions. If we nesting level, me compiler continues to choose a allow a basic block to be included in only one domina­ block, B, to "expand ." Expansion ofB initializes a new tor path, the fo rmula fo r the number of distinct parti­ dominator path to include B and adds B's dominators tionings of the dominator tree is until no more can be added. The algorimm then starts anomer dominator path by expanding another (as yet IT outdeg([ n) + 1] unexpanded) block of G. The first block of G chosen II € .\' to expand is me tail block, T, in an attempt to obtain as where N is the set of nodes of the dominator tree ." long a dominator pam as possible. Although the number of possible paths is not prohibi­ Unformnately, not all functionsare small enough to tive fo r small dominator trees, larger trees have a pro­ be tested by performing DPS fo r each possible parti­ hibitively large number. For example, whetstone's tioning of the dominator tree. Therefore, we defined main(), with 49 basic blocks, has almost two trillion 37 different heuristic memods of choosing dominator distinct partitionings. trees, based upon groupings of SL"Xkey heuristic fa ctors. To evaluate diffe rences in dominator-path choices, The maximum patl1 lengms of tl1e basic guidelines we scheduled a group of small fu nctions with DPS were adjusted to produce actual heuristics. We used using every possible choice of dominator path. The the heuristic fa ctors fr om which the individual heuris­ target architecture fo r this study was a hypotheticaJ tics were constructed; each seemed likely either to 6-wide long-instruction-word (LIW) machine, which mimic the observed characteristics of the best path was simulated and in which it was assumed that all selection or to allow more fr eedom of code motion cache accesses were hits. and, therefore, more fl exibility in filling"gaps ." The results of exhaustive dominator-path testing • One nesting level-Group blocks fr om the same show, as expected, that varying the choice of domina­ nesting level of a loop. Each block is in the same tor paths significantly affects the performance of strongly connected component, so the blocks tend scheduling. For all functions of at least two basic to have similar restrictions to code motion. For a blocks, DPS showed improvement over local schedul ­ group of blocks to be a strongly connected compo­ ing fo r at least one of tl1e possible choices of domina­ nent, there must be some path in the control tlow tor paths. Table 1 shows the best, average, and worst graph fr om each node in the component to all the percent speedup over local scheduling fo und fo r all otl1er nodes in the component. Since the fu nction functions that had a "best" speedup of over 2 percent; will probably repeat the loop, it seems likely that it also shows the speedup of tl1e original implementa- the scheduler will be able to overlap blocks in it.

Ta ble 1 Percent of Function Speedup Improvement Using DPS Path Choices over Local Scheduling

Percent Speedup

No. Dominator Function Name Best Average Worst Original Tree Partitions

bubble 39.2 10.6 -0.1 11.7 72 readm 32.5 9.3 -0.2 32.5 48 solve 27.8 9.9 -0.2 27.8 96 queens 25.4 8.3 -0.4 -0.4 96 swaprow 23.1 5.8 -3.7 19.5 24 print(g) 22.0 9.1 -0.2 22.0 8 find max 21.3 6.2 -0.3 8.7 18 copy col 18.5 5.6 -5.0 19.9 8 elim 14.3 2.3 -3.8 10.2 576 mult 13.7 2.1 -3.8 10.3 96 subst 12.9 2.4 -4.9 4.9 96 print(8) 12.5 6.2 0.0 12.5 8

Digiral Technical Journal Vo l. 10 No. l 1998 65 • Longest path-Sched ule the longest available path. Consequently, path lengths c: 1n be limited without This heuristic class Jllows the maximum distance lowering the efficiency of generated code, and longer to r code motion. paths, which increase scheduling time, c:m be avoided. • Postdominator-Follow the postdominator relation Since no one heuristic performed we ll fo r all fu nc­ in the dominator tree. When J dominator block, P, is tions, we advise using a combination of heuristics, i.e., succeeded by a non-postdominator block, S, our schedule by using each of th ree heuristics Jnd taking compiler adds P's del set to the use set of P's the best schedule. The "combined" heuristic includes Bloc kl-: nd node and the use set to the def set to the fo llowing: prevent any code motion fr om P to S. IfP is instead • Instruction density, limit to five blocks succeeded by its postdomimtor block, no such mod­ • One nesting level on path, limit to five blocks incation is necess::try, and code would be allowed to • move in both directions. Intuitively, the postd omina­ Non-postdominator, unlimited length tor relation is the cx::tctinver se of the dominator reb­ ti on, so code can move down, into a postdominator, Frequency-based List Scheduling as it moves up into a dominator. Further, the simple act of adding nodes to the DDD will complicate list Like some other global schedulers, DPS uses a local scheduling, mal

• idef size-Group by idef set size. The larger the sched uler must recognize that instructions added to idef size, the more interference there is to code blocks with higher nesting levels are more costly than motion. A small idefsize will probably allow more those added to blocks with lower nesting levels. Even code motion, so we try to add blocks with small within a loop, there exists the potenti�1l tor consider­ ide/sizes. able variation in the execution fTcq uencies ofditkrent blocks in the meta-block due to control tlow. or· • Density-Group by operation density. vVe define course variable execution fr equency is not :111 issue in the density of each basic block as the number of trad itional local scheduling bec:1use, within the con ­ nodes in the DDD divided by the number of instruc­ text of a single basic block, each DDD nodl: is exe­ tions req uired f(x local scheduling. A dense block cuted the same number of times, n:�mely, once each already has close to its maximum number of opera­ time execution enters the block. tions; Jdding or removing operations will probably To address the issue of diffe ri ng execution fr equen­ not improve the schedule. For this reason, we want cies within meta-blocks scheduled as :1 single block by to avoid scheduling dense blocks together. Two DPS, we i.nvestigated fl·equency- based list sc hedu ling methods arc tried: scheduling dense blocks >v ith (FBLS),'; an extension of Jist scheduling th::Jt provides sparse blocks and putting sp arse blocks together. an answer to this diffi culty by considering that execu­ The heuristic factors were used to make individual tion fi-equencies differ within sections of the meta­ heuristics by ch:1ngingthe limit on the possible num­ blocks. FBLS uses a greedy method to pl:1cc DDD nodes ber of blocks in a p::lth . It was reasonable to set limits in the lowest-cost instruction possible. fBLS amends fo r fo ur ta ctors : postdominator, non-postdominator, tl1e basic list-scheduling algorithm by revising only the ide/ size, and density. We tried p:�th length limits in DDD node placement policy in an attempt to reduce blocks of 2, 3, 4, 5, :1nd unlimited, making a total of the run-time cycles required to execute ;1 meta-block. five heuristics fi·om each heuristic fa ctor. Unfortunately, although FBLS makes intuitive sense, Ru nning DPS using cJch of the heuristic methods we found that DPS produced worse schedules with and comparing the efticiency of the resulting code FBLS than it produced with a naive local scheduling leads to several conclusions about effective heuristics algorithm that ignored fr equency diffe rences with in fo r choosing DPS's dominator paths. for some heuris­ DPS's meta-blocks. Therefore, the current imple­ tics, we can achieve the best schedules fo r DPS by mentation of D PS ignores the execution tt·cq uency using paths that r:1rely exceed three blocks. For :1ny diffe rences bet\.veen basic blocks, both in choosing particular class of heuristics, we can Jchievc the best dominator paths to schedule and in scheduling those schedule with paths limited to ri ve blocks or fe wer. dominator-path meta-blocks.

66 Digital Te chnical journal Vo l. 10 No. 1 1998 Evaluation of Dominator-path Scheduling measurements were made on an Alpha 21 164 server running at 250 megahertz with data cache sizes of 8 To measure the potential of DPS to generate more kilobytes, 96 kilobytes, and 4 megabytes. efficientsched ules than local scheduling fo r commer­ Looking at Table 2, we see that, in general, DPS cial superscalar architectures, we ran a small test suite improved the integer programs less than it improved of C programs on an Alpha 21164 server. The Alpha the floating-point programs. The range of improve­ server is a superscalar architecture capable of issuing ments for integer programs was from 0.7 percent fo r two integer and tvm floating-point instructions each Dhrystone to 7.3 percent each fo r 8-Queens and fo r cycle. Our compiler estimates the effectiveness of a Symbo!Table. Summing all the improvements and schedule by modeling the 21 164 as an LIW architec­ dividing by eight (the number of integer programs) ture with all operation latencies known at compile gives an "average" of 4.7 percent improvement fo r the time. Of course th is mode l was used only within the integer programs. DPS improved some ofthe floating­ compiler itself. Our results measured changes in point programs even more significantly than the inte ­ 21164 execution ti me (measured with the UNIX ger programs. The range of improvements fo r the six "time" command) required fo r each program. floating-point programs was from 3.7 percent fo r Dice Our test suite of 14 C programs includes 8 programs (a simulation of rolling a pair of dice 10,000,000 times that use integer computation only and 6 programs that using a uniform random number generator) to 17.6 include tloating-point computation. We separated percent improvement fo r the fin it e difference pro­ those groups because we see dramatic diffe rences in gram. The average fo r the six floating-point programs DPS's pertormance when viewing integer and floating­ was 10.8 percent. This suggests, not surprisingly, that point programs. To choose dominator paths, we used the Alpha 21164 provides more opportu nities fo r the combined he uristic recommended by Huber.'' global scheduling improvement when floating-point Table 2 summarizes the re sults of tests we con­ programs are being compiled. ducted to compare the execution times of programs Even within the six floating-point programs, how­ using DPS scheduling with those using local schedul­ ever, we see a distinct bi-modal behavior in terms of ing only. The table lists the programs used in the test execution-time improvement. Three of the programs suite and the percent improvement in execution times range from 12.3 percent to 17.6 percent improve­ fo r DPS-scheduled programs. The execution time ment, whereas three are below lO percent (and two of those significant ly below lO percent). A reason fo r this wide range is the use of global variables. Remember Ta ble 2 Percent DPS Scheduling Improvements over Local tl1at DPS fo rbids the motion of global variable defini­ Scheduling of Programs tions across block bo undaries. This is necessary to ensure correct program semantics. It is hardly a coinci­ Percent Execution dence that both Dice and Whetstone include only Program Time Improvement global floating-point variables, whereas Livermore's 8- Queens 7.3 fl oating-point variables are mixed about hal f local SymboiTable 7.3 and half global, and the three better performers use BubbleSort 5.0 almost no global variables. Thus we conclude that, fo r Nsieve 6. 1 floating-point programs with fe w global variables, we Heapsort 6.0 can expect improvements of roughly 12 to 15 percent Killcache 2.6 in execution time. Inclusion of global variables and exclusion of fl oating-point values will, however, TSP 2.4 decrease DPS's ability to improve execution time to r Dhrystone 0.7 the Alpha 21164. C integer average 4.7 Related Work Dice 3.7

Whetstone 5.4 As we have discussed , local instruction scheduling can Matrix Multiply 16.2 findpara llelism wi t hin a basic block but cannot exploit Gauss 12.3 parallelism between basic blocks. Several global sched­ Finite Difference 17.6 uling techniques are available, however, that extract Livermore 9.3 paral lelism from a program by moving operations across block boundaries and subsequently inserting C floating-point average 10.8 compensation copies to maintain program semantics.

Overall average 7.3 Trace scheduling1 was the first of these techniques to be defined.As previously mentioned, trace scheduling

Digital Tec hnical Journal Vol. 10 No. I 1998 67 requires compensation cop ies. Other "early" global Conclusions scheduling algorithms that require compenstation copies include Nicolau's percolation scheduling 1"·17 It is commonly accepted that to exploit the perfor­ and Gupta's region scheduling 18 A recent and quite mance benefits ofiLP, global instruction scheduling is popular extension of trace scheduling is Hwu's required. Several varieties of global instruction sched­ 0 SuperBlock scheduling.19 2 In add ition to these more uling exist, most requiring compensation copies to general, global scheduling methods, significant results ensure proper program semantics when operations have been obtained by software pipelining, which is a cross block boundaries during instruction scheduling. technique that overlaps iterations of loops to exploit Although such global scheduling with compensation available ILP. Al lan ct al.21 provide a good summary, copies may be an effective strategy fo r architectures and Rau22 provides an excellent tutorial on how modulo with large degrees of ILP, another approach seems scheduling, a popular software pipelining technique, reasonable fo r more limited architectures, such as cur­ should be implemented. Promising recent techniques rently available superscalar computers. have fo cused on defininga meta-environment, which This paper outlines DPS, a global instruction sched­ includes both global scheduling and software pipelin­ uling technique that docs not require compensation ing. Moon and Ebcioglu23 present an aggressive tec h­ copies. Based on the fa ct that more than 25 percent of nique that combi nes software pipdining and global intermediate statements can be moved upward at least code motion (with copies) into a single fra mework. one dominator block in the control flow graph with­ Novak and Nicolau2' describe a sophisticated schedul­ out changing program semantics, DPS schedules paths ing fr amework in which to place software pipelining, in a fu nction's dominator tree as meta-blocks, making including alternatives to modulo scheduling. While use of an extended local instruction scheduler to providing a significant number of excellent global schedule dominator paths. scheduling alternatives, none of these techniques pro­ Experimental evidence shows that DPS does indeed vides global sc heduling without the possibi l ity of code produce more efficientschedule s than local schedul­ expansion (copy code) as D PS does. ing fo r Compaq's Alpha 21164 server system, particu­ To address the issue of producing schedules without larly to r floati ng-point programs that avoid the use of operation copies, Bernstein2;-27 defined a technique he global variables. This work has demonstrated that con­ calls global instruction scheduling (GPS) that aJ.lows siderable fl exibility in placement of code is possible movement of instructions beyond block bou ndaries even when compensation copies are not allowed. based upon the program dependence graph (PDG).28 In Al though more research is required to look into a test suite of fo ur programs run on IBM's RS/6000, possible uses fo r this flexibility, the global instruction Bernstein's method showed improvement of rough ly scheduling method described here (DPS ) shows 7 percent over local scheduling fo r two ofthe programs, promise for lLP architectures. with no significantcliff erence fo r the others. Comparing DPS to Bernstein's method, we see that Acknowledgments both allow fo r interblock motion without copies. Bernstein also al lows fo r interblock movement req uir­ This research was supported in part by an External ing duplicates that DPS does not. Interestingly, Research Program grant fr om Digital Equipment Bernstein's later work27 does not make use of this abi l­ Corporation and by the National Science Foundation ity to al low motion that requires duplication of opera­ under grant CCR-9308348. tions, suggesting that, to date, he has not fo und such motion advisable fo r the RS/6000 architecture to References which his techniques have been ap plied . Bernstein allows operation movement in only one clirection, l. G. Tjaden and M. Flynn, "Detection of Parallel Exe­ whereas DPS allows operations to move fr om a domi­ cution of Independent Instructions," IEEE Tra nsac­ nator block to a postdominator. This added fl exibility is tions on Computers, C- 19(10) (O ctober 1970): 889-895. an advantage to DPS. Of possibly greater significance, DPS uses the local instruction scheduler to place opera­ 2. A. Nicolau and J. Fisher, "Measuring the Parallelism tions. Bernstein uses a separate set of heuristics to move Available tor Very Long Instruction Wo rd Architec­ operations in the PDG and then uses a subsequent local tures," IEEE Transactions on Co mputers, 33( ll) scheduling pass to order operations v.ri thin each block. (November 1984): 968-976.

Fisheil argues that incorporating movement of opera­ 3. J. Fisher, "Trace Scheduling: A Tec hnique to r Global tions with the scheduling phase itself provides better Microcode Compaction," IEEE Tra nsactions on Co m­ scheduling than divicling the interblock motion and puters, C-30(7) (July 1981) 478-490 scheduling phases. Based on that criterion alone, DPS has some advantages over Bernestein'smethod .

68 Digital Te chnical Journal Vo l. 10 No. 1 1998 4. J. Ellis, Bulldog A Co mp iler fo r VJJW Architectures 18. R. Gupta and M. Solh, "Region Scheduling: An (Cambridge, MA: MIT Press, 1985), Ph D. thesis, Approach fo r Detecting and Redistributing Paral­ Yale University ( 1984 ). lelism," IEEE Tra nsactions on Software Eng ineering, 16(4) (April l990 ): 42 1-43 1. 5. D. DeWitt, "A Machine- Independent Approach to the Production of Optimal Horizontal Microcode," Ph.D. 19. S. Mahlke, W. Chen, W. - M. Hwu, B. Rao, and M. thesis, University of Michigan, Ann Arbor, Mich. Schlansker, "Sentinel Scheduling fo r VLIWand Super­ (1976). scalar Processors," Proceedings of the 5th Interna­ tional Co nference on Arcbitectu.rat Support .fo r 6. D. Landskov, S. Davidson, B. Shriver, and P. Mallett, Programming Languages and Op era ting Sys tems, "Local Microcode Compaction Tec hniques," ACM Boston, Mass. (October 1992): 238-247. Computing Surveys, 12( 3) (September 1980): 261-294. 20. C. Chekuri, R. Johnson, R. Motwani, B. Natarajan, B. Rau, and M. Schlansker, "Profile-Driven Instruction­ 7. V. Allan, S. Beaty, B. Su, and P. Sweany, "Building a Level-Parallel Scheduling with Application to Super Retargetable Local Instruction Scheduler," So. ftware­ Blocks," Proceedings of the 29th International Sy m­ Practice &Experience, 28(3) (March 1998): 249-284. posium on Microarchitecture (MICR0-29), Paris, 8. E. Colfmao, Compuler and jo b-Shop Scheduling France (December 1996): 58-67 . 77JeOiy (New York : John Wiley & Sons, 1976). 21. V. Al lan, R. Jones, R. Lee, and S. Al lan, "Software 9. D. Padua, D. Kuck, and D. Lawrie, "High-Speed Mul­ Pipelining," ACJ\11 Computing Su rveys, 27(3) (Septem­ tiprocessors and Compilation Techniques," IEEE Tr ans­ ber 1995 ). actions on Computers, C-29(9) (September 1980): 22. B. Rau, "Iterative Modulo Scheduling: An Algorithm 763-776. for Software Pipelining Loops," Proceedings of tbe 10. A. Aho, R. Sethi, and }. Ullman, Compilers. Principles, 27tb International Sy mposium on Microarchitecture Te chniques, and To ols (Reading, MA: Addison­ (MICR0·27), San Jose, Calif (December 1994): 63-74. Wesley, 1986). 23. S.-M. Moon and K. Ebcioglu, "Parallelizing Nonnu­ 11. H. Reifand R. Tarjan, "Symbolic Program Analysis in merical Code with Selective Scheduling and Software Almost- Linear Time," Jo urnalof Compuling, 11 ( 1) Pipelining," ACM Tr ansactions on Progra mming (February 1981): 81-9 3. Languages and s:ystems, 18(6) (November 1997): 853-898. 12. P. Sweany, "lnterblock Code Motion without Copies," Ph.D. thesis, Computer Science Department, Col­ 24. S. Novak and A. Nicolau, "An Efficient Global Resource­ orado State University ( 1992 ). Directed Approach to Exploiting Insu·uction-Level Paral­ lelism," Proceedings qf tbe 1996Inte rnationalCo nfe rence 13. R. Muel ler, M. Duda, P. Sweany, and J. Wa licki, on Pa rallel Architectures and Compiler Te chniques "Horizon: A Re targetable Compiler lor Horizontal (PACT 96), Boston, Mass. (October 1996) 87-96. Microarchitectures," IEEE Tra nsaclions on Soft ware Eng ineering· Sp ecial Iss ue on Microprog ra mming, 25. D. Bernstein and M. Rodeh, "Global Instruction 14(5) (May 1998): 575-583. Scheduling tor Superscalar Machines," Proceedings of the ACM 51GPLAN 1991 Conference on Programming 14. B. Huber, "Path-Selection Heutistics to r Dominator­ Language Desig n and Implementation, Toronto, Path Scheduling," Master's thesis, Department of Com­ Canada (June 1991): 241-255. puter Science, Michigan Te chnological University (1995). 26. D. Bernstein, D. Cohen, and H. Krawczyk, "Code Duplication: AnAssist lo r Global Instruction Schedul­ 15. M. Bourke, P. Sweany, and S. Beaty, "Extending List ing," Proceedings of the 24th International Sy mposium Scheduling to Consider Execution Frequency," Pro­ on Microarchitecture (MICR0-24), Albuquerque, ceedings of the 28th Ha waii International Co nference N. Mex. (November 1991): 103-1 13. on Sy stem Sciences (January 1996 ). 27. D. Bernstein, D. Cohen, Y. Lavon, and V. Rainish, 16. A. Nicolau, "Percolation Scheduling: A Parallel Com­ "Performance Evaluation of Instruction Scheduling pilation Technique," Te chnical Report TR85-678, on the IBM RS/6000," Proceedings of the 25tb Inter­ Department of Computer Science, C01·neU University national Sy mposium on Microarchitecture ( MICR0- (May 1985). 25 ), Portland, Oreg. (December 1992 ): 226-235. 17. A. Aiken and A. icolau, "A Development Environ· 28. J. Ferrante, K. Ottenstein, and J. Warren, "The Pro­ ment fo r Horizontal Microcode," !Etc Transactions gram Dependence Graph and Irs Use in Optimiza­ on Software Eng ineering, 14(5) (May 1988): tion," ACM Transactionson Programming Languages 584-594 . and Sys tems, 9(3) (July 1987): 319-349.

Digital Tec hnical Journal Vo l. 10 No. l 1998 69 Biographies

Brett L. Huber Raised in Hope, lv1ichigan, Brett earned B.S. and M.S. degrees in computer science at Michigan Te chnological Philip H. Sweany University in Michigan's historic Keweenaw Peninsula. He is an engineer in the Software Development Systems group Associate Professor Phil Sweanv has been a member of Michigan Technological Unive'rsity's Computer Science at Texas I nsrruments, Inc., and is currently developing an fa culty since 1991. He has been investigating compiler optimizing compiler fo r the TMS320C6x fa milv ofVLIVV techniques fo r instruction-level parallel (ILP) architectures, digitaJ signal processors. Brett is a member oftl�e ACM and an IEEE Computer Society Affiliate. co-authoring several paperson instruction scheduling, reg­ ister assignment, and the interaction between these two optimizations. Phil has been the primary designer and implementer of Rocket, a highly optimizing compiler that is easily retargetable fo r a wide range ofiLP architectures. His research has been significantly assisted by grants fr om Digital Equipment Corporation and the National Science Foundation. Phil received a B.S. in computer science in 1983 fr om Wa shington State University, and M.S. and Ph .D. degrees in computer science from Colorado State University in 1986 and 1992, respectively.

Steven M. Carr Steve Carr is an assistant professor in the Department of Computer Science at J\tli chigan Te chnological University. The fo cus of his research at the university is memory­ hierarchy management and optimization of instruction­ level parallel architectures. Steve's research has been sup­ ported by both the National Science Foundation and DigitaJ Equipment Corporation. He received a B.S. in computer science trom Nliduga.nTechnological Uruversity in 1987 and M.S. and Ph.D. degrees fi·om Rice University in 1990 and 1993, respectively. Steve is a member o.fACM and an IEEE Computer Society Affi liate.

70 Digiral Technical Journal Vo L 10 No. I 1998 I

Mary W. Hall Jetmifer M. Anderson Sarnart P. Amarasinghe Maximizing Briart R. Murphy Shih-We i Liao Multiprocessor Edouard Bugnion Performance with Monica S. Lam the SUIF Compiler

Parallelizing compilers for multiprocessors face The affordability of shared memory multi processors many hurdles. However, SUIF's robust analysis offe rs the potential of -class performance to the general public. Typically used in a multiprogram­ and memory optimization techniques enabled ming mode, these machines increase throughput by speedups on three fourthsof the NAS and running several independent applications in parallel. SPECfp95 benchmark programs. But multiple processors can also work together to speed up single applications. This requires that ordinary sequential programs be rewritten to take advantage of

the extra processors.' 4 Automatic parallelization with a compiler otfers a way to do this. Parallelizing compilers fa ce more difficultchal lenges from multiprocessors than from vector machines, which were their initial target. Using a vector architecwre eftec· tively involves parallelizing repeated a.tithmetic opera­

tions on large data su-eams-for example , the innermost loops in array-oriented programs. On a multiprocessor, however, this approach typically does not provide suffi­ cient granularity of paral lelism: Not enough work is performed in parallel to overcome processor synch­ ronization and communication overhead. To use a multiprocessor effectively, the compiler must exploit coarse-grain paral lelism, locating large computations that can execute independently in parallel. Locating parallelism is just the first step in produc· ing efficient multiprocessor code. Achieving high per­ fo rmance also requires effe ctive use of the memory hierarchy, and multjprocessor systems have more com­ plex memory hierarchies than typical vector machines: They contain not only shared memory but also multi­ ple levels ofcache memory. These added challenges often limited tl1e effectiveness of early paralJelizing compilers fo r multiprocessors, so programmers developed their applications fi·om scratch, without assistance fromtoo ls. But explicitly managing an application'sparal lelism and memory use requires a great deal ofprogram ming knowledge, and tl1e work is tedious and error-prone. Moreover, the resulting programs are optimized fo r only a specific machine. Thus, the effort required to develop efficientparall el programs restricts © 1996 IEEE. Repri nted , with permission, ti·om CiJJIIjm/eJ; the user base fo r multiprocessors. December 1996, pages 84-89. This p3pa has been moditied fo r publication h e re with the addition of the section The Status :md This article describes automatic parallelization tech­ Future ofSl"lF niques in the SUIF (Stanford University Intermediate

Digital Tc·chnical Journ;ll Vol. 10 No. I 1998 71 Format) compiler that result in good multiprocessor Moreover, it recognizes commutative operations on pertormance fo r array-based numerical progra ms. vVe sections of an array and transforms them into parallel provide SUIF performance measurements fo r the com­ reductions. The reduction analysis is powerfulenoug h plete NAS and SPECfP95 benchmark suites. Overall, the to recognize commutative updates of even indirectly results tor these scientific programs are promising. The accessed array locations, allowing parallelization of compiler yields speedups on three fo urths of the pro­ sparse computations. grams and has obtained the highest ever pcrronnancc on All these analyses are fo rmulated in terms of integer the SPECfP95 benchmark, indicating that the compiler programming problems on systems of linear inequali­ can also achieve efficient absolute performance. ties that represent the data accessed. These inequalities are derived fr om loop bounds and array access fu nc­ Finding Coarse-grain Parallelism tions. Implementing optimizations to speed up com­ mon cases reduc<::s the compilation ti me. Multiprocessors work best when the individu,ll proces­ sors have large units of in dependent computation , but lnterprocedural Analysis Fra mework it is not easy to find such coarse-grain parallelism . First All the analyses arc implemented using a uniform the compiler must find available parallelism across pro­ interprocedural analysis fra mework, which helps ma n­ ced ure boundaries. Furthermore, the original compu­ age the software engineering complexity. The fr ame­ tations may not be parallelizable as given and may first work uses interprocedural dataflow analysis,• which is require some transtonnations. For example, experience more efficient tlun the more common technique of in parallelizing by hand suggests that we must often inline substitution.' In line substitution replaces each replace global arrays with private versions on diffe rent procedure call with J copy of the cal led procedure, processors. In other cases, the computation may then analyzes the expanded code in the usual intrapro­ need to be restructured-fo r example, we may have to cedural manner. Inline substitution is not practical for re place a sequential accumulation wi th J p:trallel reduc­ large progra ms, because it can make the program too tion operation. large to analyze .

It takes a large suite of robust analysis techniques to Our technique :1 11alyzes only a single copy of each

successfully locate coarse -grain p::trallelism . General procedure, captu ri ng irs side efrects in a function. This and unir(xm fr ameworks helped us manage the com­ function is then applied at each cal l site to produce plexity involved in building such a system into SUIF. precise results. When differentcal ling contexts make it We automated the analysis to privatize arrays and to necessary, the algorithm selectively clones a procedure recognize reductions to both sca lar and array variables. so that code can be analyzed and possibly paralle lized Our compiler's analysis techniques all operate seam­ under diffe rent calling contexts (as when diffe re nt less ly :K ross procedure boundaries. constant values Jrc passed to the same fo rmal parame­ ter). In this way the fu ll advantages of inlining are Scalar Analyses achieved without expanding the code indiscri minate ly. An initial phase analyzes scalar variables in the programs. In Figure 1 the boxes represent procedure bodies, It uses techniques such as data dependence analysis, and the lines connecting them represent procedure scalar privatization analysis, and reduction recognition calls. The m::tin computation is a series oftour loops to to detect parallelism among operations with scalar· vari­ com pute three-dimensional fast Fourier transr(mns. ables. It also derives symbolic information on these scalar Using interprocedural scalar and array analyses, tile variables that is useful in the array analysis phase. Such SU[f compiler determines that these loops are paral­ information includes constant propagation, induction lelizable. Each loop contains more than 500 lines of variable recognition and elimination, re cognition of code spanning up to nine procedures with up to 42 loop-invariant computations, and symbolic relation procedure calls. If this program had been fu lly inlined , propagation.'"' the loops pres<::nted to the compiler fo r analysis would have each contained more than 86,000 lines of code. Array Analyses An :trray analysis phase uses a unified mathematical Memory Optimization tl-amework based on linear algebra and integer linear programming. ' The analysis applies the basic data Numerical applications on high-performance micro­ dependence test to determine if accesses to an array processors are often memory bound. Even with one or can rerer to the same location. To support array priva­ more levels of cache to bridge the gap between proces­ tization, it also finds array data�ow information that sor and memory speeds, a processor may still waste half determines whether array elements used in an iteration its time stalled on memory accesses because it ITequently rd cr to the values produced in a previous iteration. references an item not in the cache (a cache miss). This

72 Digira1 Technical Journal Vo l. 10 No. l 1998 P1ifi

Figure 1 The compiler discovers parallelism through intcrprocedural array analysis. Each of the fo ur parallelized loops at left consists of more than 500 lines of code spanning up to nine procedures (boxes) with up to 42 procedure calls (lines).

memory bottleneck is fi.1rther exacerbated on multi­ ent parts of the same cache line. Such misses are processors by tl1eir greater need for memory traffic, called fa lse sharing misses. resulting in more contention on tl1e memory bus. The compiler tries to eliminate as many cache misses as An effective compiler must address fo ur issues that possible, ilien minimize tl1e impact of any iliat remain by affect cache behavior: • ensuring that processors reuse the same data as • Communication: Processors in a multiprocessor many times as possible and system communicate through accesses to the same memory location. Coherent caches typically keep • making the data accessed by each processor con­ tl1e data consistent by causing accesses to data writ­ tiguous in tl1e shared address space. ten by another processor to miss in the cache. Such Teclmiques fo r addressing each oft11ese subproblems misses are called true sharing misses. are discussed below. Finally, to tolerate tl1e latency of • Limited capacity: Numeric applications tend to have remaining cache misses, the compiler uses compiler­ large working sets, which typically exceed cache insettedpre fetching to move data into the cache before capacity. These applications often stream through it is needed. large amounts of data before reusing any of it, resulting in poor temporal locality and numerous Improving Processor Data Reuse capacity misses. The compiler reorganizes tl1e computation so mat each - • Limited associativity: Caches typically have a small processor reuses data to the greatest possible extent -' � set associativity; that is, each memory location can This reduces tl1e working set on each processor, map to only one or just a fe w locations in the cache. thereby minimizing capacity misses. It also reduces Conflict misses-when an item is discarded and interprocessor communication and thus minimizes later retrieved--can occur even when the applica­ true sharingmisses. To achieve optimal reuse, the com­ tion's working set is smaller than the cache, if the piler uses af fine pm1itioning. This technique analyzes data are mapped to the same cache locations. reference patterns in the program to derive an aftine mapping (linear transformation plus an offset) of the • Large line size: Data in a cache are transferred in computation of the data to tl1e processors. The affine fixed-sizeun its called cache lines. Applications that mappings are chosen to maximize a processor's reuse do not use all the data in a cache line incur more of data wh.ile maintaining sufficientparallelism to keep misses and are said to have poor spatial locality. On all processors busy. The compiler also uses loop block­ a multiprocessor, large cache Jines can also lead to ing to reorder tl1e computation executed on a single cache misses when different processors use differ- processor so that data is reused in the cache.

Digital Technical Journal Vol . 10 No. 1 1998 73 Making Processor Data Contiguous its knowl ed ge of the access patterns to direct the oper­ The compiler tries to arrange the data to make a ating system's page allocation policy to make each processor's accesses contiguous in the shared address processor's data contiguous in the physical address space. This improves spatial locality while red ucing space. The operating system uses these hints to deter­ conAict misses and fa lse sharing. SUIF can manage mine the virtual-to-physical p:�ge m:�pping at p:�gc

data placement within a single array and across multi­ allocation ti me . ple arrays. The data-to-processor mappings computed by the affine partitioning analysis are used to deter­ Experimental Results mine the data being accessed by each processor.

Figure 2 shows how the compi l er ' s usc of data per­ vVe conducted a series of performance evaluations to mutation and data strip-mining'" can make contiguous demonstrate the impact of· SUI�'s analyses and opti­ the data within a single arra)' that is accessed by one mizations. We obtained measurements on a Digital processor. Data permutation interchanges the dimen­ AlphaServer 8400 with eight 21 164 processors, each sions of the array-fix example, transposing a !'NO­ with two levels of on-chip cache and a 4-Mbyte exter­ dimensional array. Data strip-mining changes an nal cache. Because speedups are harder to obtain on array's dimensionality so that all data accessed by the machines with fJst processors, our usc of a state-of­ same processor are in t he same plane of the array. the-:lrt machine makes the re sults more mea ningfu l

To make data :K ross m u ltiple arrays accessed by the and ap plicable to fi.1ture systems. same processor contiguous, we use a tec hnique called \Ve used two complete standard benchmark suites compiler-directed page colorinp,. '' The co mpiler uses to evaluate our compiler. W<:. present results for the 10

y

y

X X

y

y y

X X X

STRIP-MINING PERMUTATION

Figure 2 Data transformations cJn make the dar,1 accessed by each processor contiguous in the shared address space. In the two examples above, the original arrays arc two-dimensional; the axes are identified to show that elements along the ti rst nis arc contiguous. First the aHine partitioning analysis determines which data elements arc accessed by the same processor (the shaded elements are accessed by the first processor. ) Second, data strip-mining turns the 20 Jrray into a 3D array, with the shaded elements in the same plane. Finally, applying data permutation rotates the array, mJking data accessed by each processor contiguous.

74 Digital Te chnical journal Vo l. 10 No. l 1998 programs in the SPECtp95 benchmark suite, which is techniques as well as techniques fo r locating coarse­ commonly used fo r benchmarking uniprocessors. We grain parallel loops-for example, array privatization also used the eight official benchmark programs fr om and reduction transformations, and fu ll interproce­ the NAS paral lel-system benchmark suite, except fo r dural analysis of both scalar and array variables. embar; here we used a slightly modified version fr om Memory includes the coarse-grain techniques as well Applied Parallel Research. as the multiprocessor memory optimizations we Figure 3 shows the SPECtp95 and NAS speedups, described earlier. measured on up to eight processors on a 300-MHz Figure 3 shows tl1at of tl1e 18 programs, 13 show good AJphaServer. We calculated the speedups over the best parallel speedup and can tlms take advantage of adclitionaJ sequential execution time fromeither officially reported processors. SUIF's coarse-grain techniques and memory results or our own measurements. Note that mgrid and optimizations significantly affect tl1e performance of half applu appear in both benchmark suites (the program the programs. The swim and tomcat\' programs show source and data set sizes differ slightly). superlinear speedups because the compiler eliminates To measure the effects of the diffe rent compiler almost all cache misses and their 14 Mbyte working sets techniq ues, we broke down the performance obtained fitinto the multiprocessor's aggregate cache. on eight processors into three components. In Figure For most ofthe programs that did not speed up, the 4, baseline shows the speedup obtained with paral­ compiler fo und much of their computation to be par­ lelization using only intraprocedural data dependence allelizable, but tl1e granularityis too fine to yield good analysis, scalar privatization, and scalar reduction multiprocessor performance on machines with fa st transtormations. Coarse grain includes the baseline processors. Only two applications, tppppand buk, have

16

swim 15

14

13

12

11

10 tomcatv

a.. ::> 9 8 ' Cl , em bar w / w , ll. 8 / , (f) / 7 / / 7 / / mgrid 6 applu appbt 6 turb3d hydro2d applu a.. 5 ::> /'>...._____. cgm 5 Cl w appsp su2cor w 4 a.. 4 (f)

3 3

2 2

�--�------�--�-----� buk / fftpde / / / / / 0 2 3 4 5 6 7 8 0 2 3 4 5 6 7 8 PROCESSORS PROCESSORS

(a) SPECfp95 (b) NAS Parallel Benchmarks

Figure 3 SUIF compiler speedups over the best sequential time achieved on the (a) SPECfp95 and (b) NAS parallel benchmarks.

Digital Tcdmical Journal Vo l. 10 No. 1 1998 75 14

12

Q._ 10 ::::J [il 8 w Q._ [f) 6

4

2

0 .?. E 0 "0 "0 :0 "0 ·u; a. l[) :0 ::0 a. -"' E Q) "0 "' N C') a. a. Q) (f) ::0 ro "0 u ·� Ci. a. > a. Ci. a. .D Ol .D a. u -� N e a. -e "' "' a. a. u ·� (f) ::0 "0 E "' .e- "' "' a. E E E E (f) >- .=l i: "' Q) 2 ;;: KEY:

D MEMORY OPTIMIZATION D COARSE-GRAIN PA RALLELISM • BASELINE

Figure 4 The speedup achieved on eight processors is broken down into three components to show how SUIF's memory opt.imization and discovery of coarse-grain parallelism affected perform:mce.

no statically analyzable loop-level parallelism, so they require that the software be generally available. The are not amenable to our techniques. ratios we obtained are nevertheless valid in assessing Table 1 shows the times and SPEC ratios obtained our compiler's performance.) The geometric mean of on an eight-processor, 440-MHz Digital AlphaServer the SPEC ratios improves over the uniprocessor execu­ 8400, testifYingto our compiler's high absolute per­ tion by a fa ctor of 3 with fo ur processors and by a fa c­ fo rmance. The SPEC ratios compare machine perfor­ tor of 4.3 with eight processors. Our eight-processor mance with that of a reference machine. (These are ratio of 63.9 represents a 50 percent improvement not official SPEC ratings, which among other things over the highest number reported to date.'2

Ta ble 1 Absolute Performance for the SPECfp95 Benchmarks Measured on a 440-MHz Digital AlphaServer Using One Processor, Four Processors, and Eight Processors

Execution Time (sees) SPEC Ratio

Benchmark 1P 4P 8P 1P 4P 8P

tomcatv 219.1 30.3 18.5 16.9 122.1 200.0 swim 297.9 33.5 17.2 28.9 256.7 500.0 su2cor 155.0 44.9 31.0 9.0 31.2 45.2 hydro2d 249.4 61.1 40.7 9.6 39.3 59.0 mgrid 185.3 42.0 27.0 13.5 59.5 92.6 applu 296.1 85.5 39.5 7.4 25.7 55.7 turb3d 267.7 73.6 43.5 15.3 55.7 94.3 a psi 137.5 141.2 143.2 15.3 14.9 14.7 fpppp 331.6 331 .6 331.6 29.0 29.0 29.0 waveS 151.8 141.9 147.4 19.8 21.1 20.4 Geometric Mean 15.0 44.4 63.9

76 Digital Technical Journal Vol. lO No. 1 1998 Acknowledgments 12. K. Kennedy and U. Kremer, "Automatic Data Layout to r High Performance Fortran," Proc. Supercomput­ This research was supported in part by the Air Force ing '95. IEEE CS Press, Los Alamitos, Calif., 1995 (CD-ROM only). Materiel Command and ARPA contractsF30 602-95- C-0098, DABT63-95-C-Oll8, and DABT63-94-C- 0054; a Digital Equipment Corporation grant; an Editors ' No te.· With the fo llowing section, the authors NSF Young Investjgator Award; an NSF CISE post­ provide an update on the status of the SU!F compiler doctoral fe llowship; and fe llowships from AT&T Bell since the publication of their paper in Computer in Laboratories, DEC Western Research Laboratory, December 1996. Intel Corp., and the National Science Foundation.

Addendum: The Status and Future of SUIF References

Public Availability of SUIF-parallelized Benchmarks l. J.M. Anderson, S.P. Amarasinghe, and M.S. Lam, "Data and Computation Transformations fo r Multi­ The SUIF-parallelized versions of the SPECfp95 processors," Proc. Fifth ACM S!GPla n Sy mp . Princi­ benchmarks used for the experiments described in this ples and Practice of Pa rallel Programming, ACM paper have been released to the SPEC committee and Press, New York, 1995, pp . 166-178. are available to any license holders of SPEC (see

http:/j www.specbench.org/osg/cpu95/par-research). 2. ]. M. Anderson and M.S. Lam, "Global Optimizations This benchmark distribution contains the SUIF out­ fo r Parallelism and Localiry on Scalable Parallel Machines," Proc. SIGPia n '93 Conf Programming put (C and FORTRAI'\1 code), along with the source Language Desig n and Implementation, ACM Press, code fo r the accompanying run-time libraries. We expect New York, 1993, pp. 112-125. these benchmarks will be useful fo r two purposes: ( l) fo r technology transfer, providjng insight into how 3. P. Banerjee et al., "The Paradigm Compiler fo r the compiler transforms the applications to yield the Distributed-Memory JVI.ulticompnters," Computer, Oct. 1995, pp. 37-47. reported results; and ( 2) fo r fu rther experimentation, such as in architecture-simulation studies. 4. W. Blume et al., "Effective Automatic Parallelization The SUIF compiler system itself is available from the with Polaris," In ti I Parallel Progra mming, May SUIF web site at http://www -suifstanford.edu. This 1995. system includes only the standard parallelizationanaly­ 5. E. Bugnion et al., "Compiler-Directed Page Coloring ses that were used to obtain our baseline results. for Multiprocessors," Proc. Seventh In! 'I C0 11f Archi­ tectural Support fo r Programming Languages and New Parallelization Analyses in SUIF Op erating Sys tems, ACM Press, New York, 1996, pp. Overall, the results of automatic paraUelizationreported 244-257. in this paper are impressive; however, a few applica­ 6. K. Cooper et al., "The ParaScope Parallel Program­ tions either do not speed up at all or achieve limited ming Environment," Proc. IE EE, Feb. 1993, pp. speedup at best. The question arises as to whether 244-263. SUIF is exploiting al l the available parallelism in these 7. Standard Performance Evaluation Corp., "Digital applications. Recently, an experiment to answer this Equipment Corporation AlphaServer 8400 5/440 question was performed in which loops left unparal­ SPEC CFP95 Results," SPEC Newsletter; Oct. 1996. lelized by SUIF were instrumented witl1 run-time tests to determine whether opportunities fo r increasing the 8. M. Haghighat and C. Polychronopolous, "Symbolic effectiveness of automatic parallelization remained in Analysis for Parallelizing Compilers," ACl\1 Tra ns. Pro­ these programs.' Run-time testing determined that gramming Languages and Sys tems, July 1996, pp. 477-5 18. eight of the programs from the NAS and SPEC95fp benchmarks had additional parallel loops, fo r a total of 9 . .M.W. Hall et al., "Detecting Coarse-Grain Parallelism 69 additional parallelizable loops, which is less than 5% Using an lnterprocedural Parallelizing Compiler," of the total number of loops in these programs. Of Proc. Supercomputing '95, IEEE CS Press, Los Alami­ these 69 loops, the remaining parallelism had a signifi­ tos, Calif., 1995 (CD-ROM only). cant effect on coverage (the percentage of the pro­ 10. P. Havlak, lnterprocedural !Symbolic Analysis, PhD gram that is parallelizable) or granularity (the size of thesis, Dept. of Computer Science, Rice Univ., May the parallel regions) in only fo ur of the programs: a psi, 1994. su2cor, waveS, and fftpde. 11. F. lrigoin, .P. Jouvelot, and R. Triolet, "Semantical We fo und that almost all the significant loops in Interprocedural Parallelization: An Overview of the these fo ur programs could potentially be parallelized PIPS Project," Proc. 1991 AC!J!! lnt'l Conf Supercom­ using a new approach that associates predicates with puting, ACM Press, New York, 1991, pp. 244-251. array data-flow values.2 Instead of producing conserv-

Digital Te chnical Journal Vol. lO No. 1 1998 77 ative results that hold to r all control-How paths and all SUlF component of the NCI project is the result oftl1e possible program inputs, predicated array data-flow collaboration among researchers in fi ve universities analysis can derive optimistic results guarded by predi­ (Harvard University, Massachusetts Institute of cates. Predicated array data-flow analysis can lead to Technology, Rice University, Stanford U niversity, more dkctive automatic parallelization in three ways: University of at Santa Barbara) and one (l) It improves compile-time ana.Jysis by ruling out industrial partner, Portland Group Inc. Compaq is a infeasible control-flow paths. (2) It provides a frame­ corporate sponsor of the project and is providing the work for the compiler to introduce predicates that, if FORTRAN fr on t end . proven true, would guar:mtee safety tor desirable data­ A revised version of the SUIF infrastructure (SUIF flow vaJ ues. ( 3) It enJbles the compiler to derive low-cost 2.0) is being released as part of the SUI r: NCI project run-time parallelization tests based on the predicates (a preliminary version of surr:2. 0 is available at the associated with desirJble data-flow values. S UIF web site). The completed system will be enhanced to support pa ralleliz::� tion , intcrproced ural SUIF and Compaq's GEM Compiler analysis, memory hierarchy optimizations , objected­

The GEM compiler system is the technology Compaq oriented programming, scalar optimizations, and has been using to build compiler products fo r a variety machine-dependent optimiz:nions. An overview of

of languages and hardware/software platfo rms.·1 the SUIF NCI system is shown in Figure Al. Sec Within Compaq, work bas been done to connect SUIF vvww-suif.stanford .cdu/suif/NCI/suif.html fo r more with the GEM compiler. SUIF's intermediate repre­ information about SUIF and the NCI project, includ­ sentation was converted into GEM's intermediate rep­ ing a complete list of optimizations ;m d a schedule.

resentation, so that SUIF code can be passed directly

to G EM 's optimizing back end . This e lim i nates the References Joss of info rmation suftCredwhen SUIF code is trans­ lated to C/FORTRAN source bdore it is passed to 1. B. So, S. Moon, and M. Hal l, "Measuring rhc Eftecrivc­ GEM. It also enables us to generate more efficient ness of Automatic Parallclization in SUI!:'," Proceedin;:;s code fo r Alpha-microprocessor systems. of the Jnterr/{./ffonal Confaence on SupercomjJIItinp, 98, July 1998.

SUIF and the National Compiler Infrastructure 2. S. Moon, J\11 . Hall, and B. Murph1·, "Predicated Arr:1y The SUIF compiler system was recently chosen to be Data-Flow Amlysis fo r Ru n-Time Parallelization," Pro­ part of the National Compiler Infrastrucnrre (NCI) ceedings of the fntt!rncltiiJIIUI Umfim'IICe 011 Sllj)('rcom­ project fu nded by the Defense Advanced Research puting 98, July 1998.

Projects Agency (DARPA) and the National Science 3. D. B l ickstein ct al., "The GFM Optimizing Compilct­ Foundation (NSF). The goal of the project is to System," Digilcd h·cbnical follmal. \'OI.4, no. 4 (Speci:ll develop a common compile r platform fo r researchers Issue, 1992): 121-136. and to facilitate technology transfer to industry. The

FRONT CIC++ (IBM) ENDS

INTERPROCEDURAL ANALYSIS PA RALLELIZAT ION SCHEDULING LOCALITY OPTIMIZATIONS REGISTER ALLOCATION OBJECT- ORIENTED OPTIMIZATIONS SCALAR OPTIMIZATIONS

TA RGET AL P HA C/FORTRAN LANGUAGES ._ __ __-___.I L..l ___x _s6_ _---..� l I

Figure A1 The SU!F Compiler lntrJstrucwre

78 Digital Tcc'hnicll journal Vu l. 10 No. l 1998 Biographies Brian R. Murphy A doctoral canclidate in computer science at Stanford Uni­ versity, Brian Murphy is currentlyworking on advanced pro­ Mary W. Hall gra m analysis under SUIF as part of the National Compiler Mary Hall is jointly a research assistant professor and project Infrastructure Project. He received a B.S. in computer sci­ leader at the University of SouthernCalif ornia, Department ence and engineering and an M.S. in electrical engineering of Computer Science and at USC's Information Sciences and computer science fr om the Massachusetts Institute of Institute, where she has been since 1996. Her research Technology. His master's thesis work on program analysis interests fo cus on compiler support fo r high-performance was carried out with the Functional Languages group at computing, particularly interprocedural analysis and auto­ the IBM Almaden Research Center. Brian was elected to matic parallelization. She graduated magna cum laude with me Tau Beta Pi and Eta Kappa Nu honor societies. a B.A. in computer science and mathematicalsciences in 1985 and received an M.S. and a Ph.D. in computer science in 1989 and 1991, respectively, all from Rice University. Prior w joining USC/lSI, she was a visiting assistant pro­ fessor and senior research tdlow in the Department of Computer Science at Caltech. In earlier positions, she was a research scientist at Stanford University, working with the SUIF Compiler group, and in the Center fo r Research on Parallel Computation at Rice University.

Shih-Wei Liao Shih-Wei Liao is a doctoral candidate at the Stanford University Computer Systems Laboratory. His research interests include compiler algorithms and design, pro­ gramming environments, and computer architectures. He received a B.S. in computer science from National Taiwan University. in 1991 and an M.S. in electrical engineering from Stanford University in 1994.

Jennifer M. Anderson Jenniter Anderson is a research staff member at Compaq's Western Research Laboratory where she has worked on the Digital Continuous Profiling Infrastructure (DCPI) proj­ ect. Her research interests include compiler algorithms, programming languages and environments, profiling sys­ tems, and parallel and distributed systems software. She earneda B.S. in intormation and computer science from the University of California at Irvine and received M.S. and Ph. D. degrees in computer science from Stanford University.

Edouard Bugnion Ed Bugnion holds a Diplom in engineering from the Swiss Federal Institute ofTechnology ( ETH), Zurich ( 1994) and an M.S. from Stanford University ( 1996 ), where he is a doctoral candidate in computer science. His research inter­ ests include operating systems, computer architecture, and machine simulation. From 1996 to 1997, Ed was also a research consultant to Compaq's Western Research Laboratory. He is the recipient of a National Science Foundation Graduate Research Fellowship.

Saman P. Amarasinghe Sa man Amarasinghe is an assistant professor of computer science and engineering at the Massachusetts Institute of Technology and a member oftl1e Laborawry fo r Computer Science. His research interests include compilers and com­ puter architecture. He received a B.S. in electrical engineer­ ing and computer science from Cornell University and M.S. and Ph.D. degrees in electrical engineering fr om Stanford University.

Digital Te chnical Journal Vo l. 10 No. I 1998 79 MonicaS. Lam Monica Lam is an associate professor in the Computer Science Department at Stanford University. She leads the SUIF project, which is aimed at developing a common infrastructure to support research in compilers fo r advanced languages and architectures. Her research inter­ ests are compilers and computer architecture. Monica earned a B.S. fr om the University of British Columbia in 1980 and a Ph.D. in computer science fi·om Carnegie Mellon University in 1987. She received the National Science Foundation Young Investigator award in 1992.

80 Digital TechnicalJourna l Vo l. 10 No. 1 1998 I

Ronald F. Brender Jeffrey E. Nelson Mark E. Arsenault Debugging Optimized Code: Concepts and Implementation on DIGITAL Alpha Systems

Effective user debugging of optimized code has Introduction been a topic of theoretical and practical interest in the software development community for In software development, it is common practice to debug a program that has been compiled with little or almost two decades, yet today the state of the no optimization applied. The generated code closely art is still highly uneven. We present a brief sur­ corresponds to the source and is readily described by a vey of the literature and current practice that simple and straightforward debugging symbol table. A leads to the identification of three aspects of debugger can interpret and control execution of the debugging optimized code that seem to be code in a fa shion close to the user's source-level view critical as well as tractable without extraordi­ of the program. Sometimes, however, developers find it necessary or nary efforts.These aspects are (1) split lifetime desirable to debug an optimized version of the pro­ supportfor variables whose allocation varies gram . For instance, a bug-whether a compiler bug or within a program combined with definition incorrect source code-may only reveaJ itself when point reporting for currency determination, optimization is appLied. In other cases, the resource (2) stepping and setting breakpoints based on constraints may not aLlow the unoptimized fo rm to be a semantic event characterization of program used because the code is too big and/or too slow. Or, the developer may need to start anaJysis using the behavior, and (3) treatment of inlined routine remains, such as a core file, of the fa iled program, calls in a manner that makes inlining largely whether or not this code has been optimized . Whatever transparent. We describe the realization of the reason, debugging optimized code is harder than these capabilities as partof Compaq's GEM debugging unoptimized code-much harder-because back-end compiler technology and the debug­ opti mization can greatly compLicate the relationship ging component of the OpenVMS Alpha oper­ between the source program and the generated code. Zellweger1 introduced the terms expected behavior ating system. and truthfu l behavior when referring to debugging optimized code. A debugger provides expected behav­ ior if it provides the behavior a user wo uld experience when debugging an unoptimized version of a pro­ gram. Since achieving that behavior is often not possi­ ble, a secondary goal is to provide at least truthful behavior, that is, to never lie to or mislead a user. In our experience, even tr uthfuL behavior can be chal­ lenging to achieve, but it can be closely approached . This paper describes three improvements made to Compaq's GEM back-end compiler system and to OpenVMS DEBUG, the debugging component of the OpenVMS Alpha operating system. These improve­ ments address

1. Split lifetime variables and currency determination

2. Semantic events

3. Inlining

Digital Technical Journal Vol. 10 No. 1 1998 81 Before presenting the details of this work, we dis­ generally must include all call sites and may corre­ cuss the alternative approaches to debugging optimized spond to most statement boundaries. His experience code that we considered, the state of the art, and the suggests, however, that even limiting inspection points operating strategies we adopted. to statement boundaries severely limits almost all kinds of optimization. Alternative Approaches Holzle et al.8 describe techniques to dynamically Various approaches have been explored to improve deoptimize part of a program (replace optimized code the ability to debug optimized code. They include with its unoptimized equivalent) during debugging to the fo llowing: enable a debugger to perform requested actions. They make the technique more tractable, in part by delaying • Enhance debugger analysis asynchronous events to well-defined interruption • Li mit optimization points, generally backward branches and calls. Opti­ • Limit debugging to preplan ned locations mization between interruption points is unrestricted.

• Dynamically deoptimize as needed However, even this choice of interruption points severely limits most code motion and many other • Exploit an associated program database global optimizations. We touch on these approaches in turn. Pollock and others9 10 use a diffe rent kind of deopti­ In probably the oldest theoretical analysis that mization, which might be called preplanned, incre­ supports debugging optimized code, Hennessyl stud­ mental deoptimization. During a debugging session, ies whether the value displayed fo r a variable is current, any debugging requests that cannot be honored that is, the expected value fo r that variable at a given because of optimization effects are remembered so point in the program. The value displayed might not that a subsequent compilation can create an exe­ be current because, for example, assignment of a later cutable that can honor these requests. This scheme is value has been moved fo rward or the relevant assign­ supported by an incremental optimizer that uses a pro­ ment has been delayed or omitted. Hennessy postu­ gram database to provide rapid and smooth fo rward lates that a flow graph description of a program is info rmation flmv to subsequent debugging sessions. communicated to the debugger, which then solves Feiler' ' uses a program database to achieve the bene­ certain flow analysis equations in response to debug fits of interactive debugging while applying as much commands to determine currency as needed. static compilation technology as possible. He describes Copperman' takes a similar though much more gen­ techniques fo r maintaining consistency between the eral approach. Conversely, commercial implementa­ primary tree-based representation and a derivative tions have fa vored more complete preprocessing of compiled fo rm of the program in the fa ce of both information in the compiler to enable simpler debug­ debugging actions and program modifications on-the­ ger mechanisms.H fly. Whilehe appears to demonstrate that more is possi­ Ifoptimization is the "problem," then one approach ble than might be expected, substantial limitations still to solving the problem is to limit optimization to only exist on debugging capability, optimization, or both. those kinds that are actually supported in an available A comprehensive introduction and overview to these debugger. Zurawski7 develops the notion of a recovery and other approaches can be fo und in Copperman3 and fu nction that matches each kind of optimization . As an Adl-Tabatabi." In addition, "An Annotated Biblio­ optimization is applied during compilation, the com­ graphy on Debugging Optimized Code" is available pensating recovery function is alsocreated and made separately on the Dl:(!,ital Te cl:mical.fourna! web site at

available fo r later use by a debugger. Ifsuch a recovery http:// wvvw.digital.com/info/DTJ. This bibliography function cannot be created, then the optimization is cites and summarizes tbe entire literature on debugging omitted. Unfortunately, code-motion-related optimi­ optimized code as best we know it. zations generally lack recovery functions and so must be fo regone. Taking this approach to the extreme State of the Art converges with traditional practice, which is simply to When we began our work in early 1994, we assessed disable all optimization and debug a completely unop­ the level of support fo r debugging optimized code timized program. that was available with competitive compilers. Because lffull debugger functionality need only be provided we have not updated this assessment, it is not appro­ at some locations, then some debugger capabilities can priate fo r us to report the results here in detail. We do be provided more easily. Zurawski7 also employed this however summarize the methodology used and the idea to make it easier to construct appropriate recov­ main results, which we believe remain generally valid. ery fu nctions. This approach builds on a language­ VIe created a series of example programs that pro­ dependent concept of inspection points, which vide opportunities fo r optimization of a particular kind

82 Digital TechnicalJournal Vo l. 10 No. 1 1998 or of related kinds, and which could lead a traditional unoptimized. There seems to be no particular benefit debugger to deviate from expected behavior. We com­ to creating a special intermediate level of combined piled and executed these programs under the control debugger/optimization support. of each system's debugger and recorded how the sys­ Pragmatically, we did not have the time or staffi ng tem handled the various kinds of optimization. The to develop a new optimization framework, fo r exam­ range ofobserved behaviors was diverse. ple, based on some kind of program database. Nor At one extreme were compilers that automatically were we interested in intruding into those parts of the disable all optimization if a debugging symbol table is GEM compiler that performed optimization to create requested (or, equivalently fo r our purposes, give an more complicated options and variations, which might error if both optimization and a debugging symbol be needed fo r dynamic deoptimization or recovery table are requested). For these compilers, the whole function creation. exercise becomes moot; that is, attempting to debug Finally, it seemed sensible to perform most analysis optimized code is not allowed. activitieswithin the compiler, where the most complete Some compiler/debug ger combinations appeared information about the program is already available. It is to usefully support some of our test cases, although conceivable that passing additional information from none handled all of them correctly. In particular, none tl1e compiler to the debugger using the object file seemed able to show a traceback of subroutine calls debugging symbol table might eventually tip the bal­ that compensated fo r in liningof routine calls and all ance toward performing more analysis in the debugger seemed to produce a Jot ofjit ter when stepping by line proper. The available size data (presented later in this on systems where code is highly scheduled. paper in Table 3) do not incticate this. The worst example that we fo und allowed compila­ We identified three areas in which we fe lt enhanced tion using optimization but produced a debugging capabilities would significantly improve support fo r symbol table that did not reflect the results ofthat opti­ debugging optimized code. These areas are mization. For example, local variables were described l. The handling of split lifetime variables and currency as allocated on the stack even though the generated determination code clearly used registers fo r these variables and never accessed any stack locations. At debug time, a request 2. The process of stepping though the program to examine such a variable resulted in the ctisplay of the 3. The handling of procedure inlining irrelevant and never-accessed stack locations. In the fo llowing sections we present the capabilities we The bottom line fi·om this analysis was very clear: developed in each of these areas together with insight the state of the art fo r support of debugging opti­ into the implementation techniques employed. mized code was general ly quite poor. DIGITA L's First, we review the GEM and OpenVMS DEBUG debuggers, including OpenVMS DEBUG, were not framework in which we worked. The next three sec­ unusual in this regard. The analysis did indicate some tions address the new capabilities in turn. The last good examples, though. Both the CONVEX CXdb4·" major section explores the resource costs (compile­ and the HP 9000 DOC6 systems provide many valu­ time size and performance, and object and image able capabilities. sizes) needed to realize these capabilities.

Biases and Goals Starting Framework Early in our work, we adopted the fo llowing strategies:

• Do not limit or compromise optimization in any way. Compaq's GEM compiler system and the OpenVMS

• Stay within the t!·amework of the traditional edit­ DEBUG component of the OpenVMS operating compile-link-debug cycle. system provide the framework fo r our work. A brief description of each fo llows. • Keep the burden of analysis within the compiler.

The prime directive fo r Compaq's GEM-based GEM

compilers is to achieve the highest possible perfor­ The GEM compiler system 13 is tl1e technology mance from the Alpha architecture and chip technol­ Compaq is using to build state-of-the-art compiler ogy. Any improvements in debugging such optimized products fo r a variety of languages and hardware and code should be useful in the £1 ce of the best that a software platforms. The GEM system supports a range compiler has to offer. Conversely, ifa programmer has of languages (C, C++, FORTRAN including HPF, the luxury of preparing a less optimized version fo r Pascal, Ada, COBOL, BLISS, and others) and has been debugging purposes, there is little or no reason fo r successfully retargeted and rehosted fo r the Alpha, that version to be anything other than completely MIPS, and Intel IA-32 architectures and to r the

Digital Te chnical Journal Vol. 10 No. l 1998 83 OpenVJV! S, DIGITAL UNIX, Windows NT, and • Display the source-level view of the program's exe­ operating systems. cution usi ng either a graphical user inte rface or a The major components of a GEM compiler are the character- based user inte rrace

front end, the optimizer, the code ge nerator, the final • Examine user variables and hardware recrist0 ers code stream optimizer, and the compiler shell. • Display a stack traceback showing the current call • The fr ont end performs lexical analysis and parsing stack of the source program. The primary outputs are • Set watch points intermediate language (IL) graphs and sym bol • Perform many other fu nctions1' tables. Front ends fo r all source languages translate to the same common representation. Split Lifetime Va riables and Currency • The opti mizer transforms the IL generated by the Determination fr ont end into a semantically eq uivalent fo rm that wi ll execute fa ster on the target machine. A signifi­ Displaying (printing) the value of a program vatiable is cant te chnical achievement is that a si ngle optimizer one of th e most basic services that a debugger can pro­ is used ror all languages and target platforms. vide. For unoptimized code and traditional debug­ • The code generator translates the IL into a list of gers, the mechanisms fo r doing this are generally code cells, each of which represents one machine based on several assumptions. instruction fo r the target hardware . Virtually al l the target machine instruction-specificcode is encapsu­ l. A variable has a single al location that remains f-ixed lated in the code ge nerator. throughout its lifetime.For a local or a stack-allocated variable that means throughout the lifetime of the • The final phase pertorms pattern-based peephole scope in which the variable is declared . optimizations followed by instruction sc heduling. 2. Definitions and uses of the va lues of user variables • The shell is a portable interface to the external envi­ occur in the same order in the ge nerated code as ron ment in which the compiler is used. It provides they do in the original program source. common compiler fu nctions such as listing genera­ tors, object fi le emitters, and command line proces­ 3. The set of instructions that belong to a given scope sors in a fo rm that allows the other components to (which may be a ro utine body) can be described by remain independent of the operating system. a single contiguous range of addresses.

The bu lk of the GEM implementation work described The first and second assumptions arc of interest in this in this paper occurs at the boundary betwe en the final discussion because many GFM optimizations mal(e phase and the object fileout put portion of the shell. A them inappropriate. Split lifetime optimization (dis­ new debugging optimized code analysis phase exam­ cussed later in this section) leads to violation of the first ines the generated code stream representation of the assumption. Code motion optimization leads to viola­ program, together with the compiler symbol table, to tion of the second assumption and thereby creates rl1e extract the information necessary to pass on to a so-called currency problem. 'I'Ve treat both �frl1ese prob­ debugger through the debuggi ng sy mbol table. Most lems together, and we refer to them collectively as .>plit of the implementation is readily adapted to different lifetime suppo11. Statement and instruction scheduling target architectures by means of the same instruction optimization leads to violation of the rl1ird assumption. property tables that arc used in the code generator and This topic is addressed later, in the section I nlining. finaloptim izer. Split Lifetime Va ria ble Definition

Open VMS DEBUG A variable is said to have split lifetimes if the set of The OpenVMS Alpha debugger, originally developed fe tches and stores of the variable can be partitioned for the OpenVMS VAX system,1'' is a fu ll-function, such that none of the values stored in one subset are source-level, symbolic debugger. It supports symbolic ever fe tched in another subset. When such a partition debugging of programs written in BLISS, MACR0-32, exists, the variable can be "split" into several indepen­ MACR0-64, FORTRAN, Ada, C, C++, Pascal, PL/1, dent "child" variables, each corresponding to a parti ­ BASIC, and COBOL. The debugger allows the user to tion. As independent variables, the child variables can control the execution and to examine the state of a be allocated independent!}'· The eftect is that the program. Users can original variable can be thought to reside in diftcrent locations at diffe rent points in ti me-sometimes in a • Set breakpoints to stop at certain points in the program register, sometimes in memory, and sometimes • Step through the execution of the program a line at nowhere at all. Indeed, it is even possible ror the dirfe r­ a time ent child variables to be active simultaneously.

84 Digital Te chnical journal Vo l to No. I 1998 Split Lifetime Example A simple example of a split X_Fioating) variables as we ll as variables of any of the litctime variable can be seen in the fo ll ow ing straight­ complex types (see Sites'6). These latter variables are line code fragment: referred to as two-part variables because each requires two registers to hold its value. A = ; ! Define (assign va lue to) A

B = A. ; ! Use def i ition (v lue o f) A Currency Definition = 1 De ine A again A ; The value of a variable in an optimized program is cur­

c = A 1 Use lat er efini ion A rent with respect to a given position in the source pro­ gram if the variable holds the value that wou ld be In this example, the first value assigned to variable A is expected in an unoptimized version of the program. used later in the assignment to variable B and then Several kinds of optimization can lead to noncurrent never used agai n. A new value is assigned to A and variables. Consider the currency example in Figure l. used in the assignment to vari able C. As shown in Figure l, the optimizing compiler has Without changing the meaning of this fragment, we chosen to change the order of operations so that line 4 can rewrite the code as is executed prior to line 3. Now suppose that execu­ Al - ..., ! D e f ine Al tion has stopped at the instruction in line 3 of the unoptimized code, the line that assigns a value to vari­ B = . . . .'1. 1. .. , I Use .:0.1 2 - . . . , Define .·. 2 able C. Given a req uest to display (pri nt) the value of A, c = ....�� . .. ' Use A2 a traditional debugger will display whatever value where variables Al and A2 are split child variables of A. happens to be contained in the location of A, which Because Al and A2 are independent, the fo llowing here , in the optimized cod e, happens to be the result is also an equivalent fr agment: of the second assignment to A. This displayed value of A is a correct value, but it is not the expected Al ., ! De fine Al value that should be displayed at line 3. This scenario A 2 •• •I ! De Eine l\?. might easi ly mislead a user into a fr ustrating and B Jl ! Use .'U fr uitless attempt to determine how the assignment

c ..A2 . ' I Use A2 in line l is computing and assigning the wrong value. The problem occurs because the compiler has Here, we see that the value of A2 is assigned while the moved the second assignment so that it is early rela­ value of Al is still alive . That is, the split chiJdren of a tive to line 3. single variable have overlapping lifetimes. Another currency example can be seen in the fr ag­ This example illustrates that split lifetime optimi­ ment (taken fr om Copperman·') that appears in Figure zation is possible even in simple straight-line code. 2. In this case, the optimizing compiler has chosen to Moreover, other optimizations can create opportuni­ omit the second assignment to variable A and to assign ties fo r split lifetime opti mization that may not be that value directly into the actual parameter location apparent fr om casual examination of the original used fo r the call of routine FOO. Suppose that the source. In particular, loop unrolling (in which the debugger is stopped at the call of routine FOO. Given body of a loop is replicated several times in a row ) a request to display A, a traditional debugger is likely can create loop bodies fo r which split lifetime opti­ to display the resu lt of the first assignment to A. Again, mization is fe :�sible and desirable. this value is an actual va lue of A, but it is not the expected value. Variables of Interest Our implementation deals only Alte rnatively, it is possible that prior to reaching the with scalar variables and parameters. This includes call, the optimizing compiler has decided to reuse the Al pha's extended precision tl oating-point ( 128-bit

Line U noptimized Optimized 1 A ! De Ei e A A ...' 2 B ...A ...' Use 7'. B = ...A ...; 3 c c es no depend on A A 4 A De fine 7'. c 5 D ...A ...; ! U e s econa aefinit io o[ A D = ...A ...;

Figure 1 Currency Example 1

Digital Tec hnical journal Vo l. 10 No. 1 1998 85 Line Unoptimized Optimized l A express i o nl ; A ex re s ionl; I 2 B = ...A ...; Use 1st def . of A B = ...A ...; 3 A = e xp ression . ; 4 FOO ( ) ; I Use 2 nd deE . 0 A FOO ( e xpressi on . ) ;

Figure 2 Cu rrency Example 2

location that originally held the first value of A fo r locations hold values of user variables at any given another purpose. In this case, no val ue of A is avai lable point in the program and combine this with the set of to display at the call of routine FOO. definition locations that provide those values. Because Finally, consider the example shown in Figure 3, there may be more than one source location, the user which illustrates that the currency ofa vari able is not a is given the basic information to determine where in property that is invariant over time. Suppose that exe­ the source the value of a variable may have originated . cution is stopped at line 5, inside the Joop. In this case, Consequently, the user can determine whether the A is not current during the first time through the loop value displayed is appropriate fo r his or her purpose. body because the actual value comes fr om line 3 (moved fr om inside the loop ); it should come ti·om Compiler Processing line 1. On subsequent times through the loop, the A compiler performs most split lifetime analysis on a value fr om line 3 is the expected value, and the value of routine-by-routine basis. A preliminary walk over the A is current. entire symbol table identifies the variable symbols that As discussed earlier, most approaches to currency are of interest fo r further analysis. Then, fo r each rou­ determination involve making certain ki nds of Aow tine, the compiler performs the fo llowing steps: graph and compiler optimization information avail ­ • Code cell prepass able to the debugger so that it can report when a dis­ • played va lue is not current. However, we wan ted to Flow graph construction avoid adding major new kinds of analysis capability to • Basic block processing

DIGITAL's debuggers. • Parameter processing More fundamentally, as the degree of optimization • Backward propagation increases, the notion of currentposition in the program • itself becomes increasingJy ambiguous. Even when the Forward propagation particular instruction at which execution is pending can • Information promotion and cleanup be clearly and unequivocally related to a particular source After the compiler completes this processing fo r location,this location is not automaticallythe best one to all ro utines, a symbol table posrwal k performs fi nal use to r currency determination. Nevertheless, the source cleanup tasks. The fo llowing contains a brief discus­ location (or set of locations) where a displayed value was sion of these steps. assigned can be reliably reported without needing to In this summary, we highlight only the main charac­ establish the current position . tetistics of general interest. In particular, we assume that Accordingly, we use an approach diffe rent than each location, such as a register, is independent of all those considered in the literature. We use a straight­ other locations. This assumption is not appropriate to fo rward flow analysis fo rmulation to determine what locations on the stack because variables of different sizes

Line On optimized Optimized l A = ...; A = ...; 2 ...A ...; ...A ...; 3 A - . . . ; 4 �;hile ( . . . ) wh ile ( . ..) 5 6 A - ... , II A is 1 op in va rian t 7 }

Figure 3 Currency Example 3

86 Digital Tec hnical Journal Vo l. 10 No I 1998 may overlay each other. The complexity of dealing with When an instruction reads an operand with a base overlapping allocations is beyond the scope of this paper. symbol that indicates an interesting variable, some Of special importance in this processing is the fa ct more unusual processing applies. that each operand of every instruction includes a base If the variable being read is already known to sy mbol fieldthat refers to the compiler's symbol table occupy that location, then no further processing is entry fo r the entity that is involved. required. This is the most common case. If the location already contains some other known Symbol Ta ble Prewalk The symbol table prewalk variable, then the variable being read is added to the identifies the variables of interest fo r analysis. As dis­ set of variables fo r that location. This situation can cussed, we are interested in scalars corresponding to arise when there is an assignment of one variable to user variables (not compiler-created temporaries), another and the register allocator arranges to allocate including Alpha's extended precision floating-point them both to the same location. As a result, the assign­ ( 128-bit X_Fioating) and complex values. ment happens implicitly. DIGITAL's FORTRAN implementations pass para­ If the location does not contain a known variable meters using a by-reterence mechanism with bind but there is a write operation to that location earlier in (rather than copy-in/copy-out) semantics. GEM treats the same block (a fa ct that is available from the loca­ the hidden reference value as a variable that is su bject tion description), the prior write is retroactively to split lifetime optimization. Since the reference vari­ u· eated as though it did write that variable at the earlier able must be available to effect operations on the logi­ instruction. This situation can atise when the result of cal parameter variable, it fo llows that both the abstract a function call is assigned to a variable and the register parameter and its reference value must be treated as allocator arranges to al locate that variable in tl1e regis­ interesting variables. ter where the call returnsits value. The code cell repre­ sentation for the call contains nothing that indicates a Code Cell Prepass The code cell prepass performs a write to the variable; all that is known is that me return single walk over all code cells to determine value location is written as a result of the call. Only when a later code cell indicates that it is using the value • The maximum and minimum offsets in the stack of a known variable from that location can we infer fr ame that hold any interesting variables more ofwhat actually happened. • The highest numbered register that is actually refer­ Ifthe location does not contain a known variable and enced by the code there is no write to that same location earlier in this

• Whether the stack fr ame uses a frame pointer that is same basic block, then the definingin struction cannot separate fr om the stack pointer be immediately determined. A location description is created fo r the beginning of tl1e basic block indicating The compiler uses these characteristics to preallocate that the given variable or set of variables must have various working storage areas. been definedin some predecessor block. Ofcourse, the contents known as a result of the read operation can Flow Graph Construction A flow graph is built, in also propagate fo rward toward the end of the block, which each basic block is a node ofthe graph. just as fo r any other read or write operation. Special care is needed to deal �� th a two-part variable. Basic Block Processing Basic block processing per­ Such a variable does not become defined until both fo rms a kind of symbolic execution of the instructions instructions that assign tl1e value have been encoun­ of each block, keeping track of the effect on machine tered. Similarly, any reuse of eitl1er ofthe two locations state as execution progresses. ends tl1e ufetime segment of me variable as a whole. When an instruction operand writes to a location At the end of basic block processing, location with a base symbol that indicates an interesting vari­ descriptions specif)' what is known about the contents able, the compiler updates the location description to of each location as a result of read and write operations indicate tbat the variable is now known to reside in that occurred in the block. This description indicates that location-this begins a lifetime segment. The the set of variables that occupy the location, or that the instruction that assigned the value is also recorded location was last written by some value that is not the with the lifetime segment. value of a user variable, or that the location does not Ifthere was previously a known variable in that loca­ change during execution of the block. tion, that lifetime segment is ended (even if it was fo r the same variable). The beginning and ending instruc­ Parameter Processing The compiler models parame­ tions fo r that segment are then recorded with the vari­ ters as locations that are definedwith the contents of a able in the symbol table. known variable at the entry point of a routine.

Digiral Technical Journal Vo l. 10 No. l 1998 87 Backward Propagation Backward propagation iter­ Debugger Processing ates over the flow graph and uses the locations with Name resolution, that is, binding a textual name to the known contents at the beginning of a block to work appropriate entry in the debug symbol table, is in no backward to predecessor blocks looking to r instruc­ way affected by whether or not a variable has split lite ­ tions that write to that location. For each variable in time segments. After the symbol table entry is fo und, each input location, any such prior write instruction is any sequence of lifetime segments is searched fo r one retroactively made to look like a definition of the vari­ that includes the current point of execution indicated able. Note that this propagation is not a fl ow algo­ by the program counter (PC). If fo und, the location of rithm because no convergence criteria is involved; it is the value is taken fr om that segment. Otherwise, the simply a kind of spanning walk. value of the variable is not available.

Forward Propagation Forward propagation iterates Usage Example over the flow graph and uses the locations with known To illustrate how a user sees tl1e results of this processing, contents at the end of each block to work fo rward to consider me smaJJ C program in Figure 4. Note mat me successor blocks to provide known contents at the numbers in me left colunmare listingline numbers. beginning of other blocks. This is a classic "reaching \Vhen DOCT8 is compiled, linked, and executed definitions" flowalg orithm, in which the input state of under debugger control, me dialogue shown in Figure 5 a location for a block is the intersection of the known appears. The figure also includes interpretive comments. contents from the predecessors. In our case, the compiler also propagates definition Known Limitations points, which are the addresses of the instructions that The fo llowing limitations apply to the existing split begin the lifetime segments. For those variables that are lifetime support. known to occupy a location, the set of definitions is tl1e union of all the definitions that flow into that location. Multiple Active Split Children While the compiler analysis correctly determines multiple active split child Information Promotion and Cleanup The finalstep of variables and me debug symbol tab.le corrccdy describes compiler processing is to combine information fo r adja­ them, OpenVMS DEBUG docs not currendy support cent blocks where possible. This action saves space in me multiple active child variables. When searching a sym­ debugging symbol table but does not affect me accuracy bol's lifetimesegments fo r one mat includes the current of the desctiption. Descriptions fo r by-reference bind PC, me first match is taken as the ortlymatch . parameters are next merged witl1 me descriptions for the associated reference variables. Finally, lifetime segment Tw o-part Va riables Support fo r two-part variables information not already associated wim symbol table (those occupying two registers) assumes that a com­ entries is copied back. plete definitionwi ll occur within a single basic block.

Object File Representation The object filede bugging symbol table representation to r split lifetime variables is actually quite simple. 385 oct8 () { Instead of a single address fo r a variable, there is a 386 387 int i, j, k; sequence of lifetime segment descriptions. Each life­ 388 time segment consists of 389 i 1 ; 390 j 2 ; • The range of addresses over which the child loca­ 391 k 3; tion applies 392 393 if ( f oo (i) ) • The location (in a register, at a certain offSet in the 394 j = 17 ; current stack fra me, indirect through a register or 395 ) else ( stack location, etc.) 396 397 k 1 8 ; • The set of addresses that provide definitions fo r this 398 ) 399 lifetime segment 400 print f ( " %d , %d , %d\ n ", i, j, k) ; 401 By convention, the last segment in the sequence can 402 have the address range 0 to FFFFFFFF (hex). This

address range is used fo r a static variable, fo r example Figure 4 in a FORTRAN COMMON block, that has a default allo­ C Example Routine DOCT8 (Source with Listi ng Line cation that applies whenever no active children exist. Numbers)

88 Digital Technical Journal Vo L 10 No. 1 1998 $ tun doc t B OpenVMS Alpha Debug64 Ve rs ion T7 . 2 - 001 %I, langucge i s C, mod le set to DOCT 8 DBG> s ep / i nto s teppe . o DOCTB \doc 8\%LINE 391 3 9 1 : k = 3 ; DBG > examine i, j , k

%�1 . en tity · i ' v1as no l a llocated in memory (wa ptlmized a\v.:ly) %W , en ily ' j ' does no h ve va l ue at the curren t PC %W , en tity ' k ' does no t h v a va lue a the current PC

Note the diffe rence in the message fo r variable i compared to the messages fo r variables j and k. We see that variable i was not allocated in memory (registers or otherwise), so there is no point in ever trying to examine its value again. Variables jand k, however, do not have a vaJ ue "at the current PC." Somewhere later in the program they will have a value, but not here. The dialogue continues as fo llows:

DBG> step 6 s epp ed to DOCT 8 \ doc t 8 \ %LINE 391 3 9 1 : k = 3; OBG step stepped to DOCT8 \ doc t8\%LINE 393 393 : if (foo(i) ) ( DBG> exami ne j , k at the %1-'1 , ent ity ' j ' do s not have a value current PC DOCT8 \ doct8 k: 3 value defined at DOCT8 \ oct 8 \ %LINE 391

Here we see that j is still undefined but know has a value, namely 3, which was assigned at line 391. The source indicates thatjwas assigned a value at line 390, before the assignment to k, butj's assign­ ment has yet to occur. Skipping ahead in the dialogue to the print statement at line 400, we see the fo Jiowing:

DBG> set br ak %l ine 400 DBG> g o break at DOCT 8 \ d oct 8\% LIN E 400 4 0 0 : pr in t f ( " %d , %d , %d\n" , i, j, k) ; DBG> examine j OCTB \ oc t 8 \ j : 2 va l de f i ned at DOCT 8 \ d oc t 8 \ % L I NE 390 value de fined at DOCT 8 \ d oct8\%LINE 394 DBG> ex mine k DOCTS \ oc t 8 \ k : 18 va lue de f ined at DOCT8 \ doc t8 \ %LI E 397+4 val e e ined at DOCT8\ oc t8\ %LI E 391

This portion ofthe message shows that more than one definition location is given for bothj and k. Which of each pair applies depends on which path was taken in the if statement. If a variable has an apparently inappropriate value, this mechanism provides a means to take a closer look at those places, and only those places, from vvh ich that value might have come.

Figure 5 Dialogue Resulting from Running DOCT8

That is, at the end of a basic block, if the second part of Suppose that the last use of variable X occurs in the a definition is missing then the initial part is discarded assignment to variable Y so that X and Y can be and are and fo rgotten. allocated in the same location, in particular, the same Consider the following FORTRANfra gment: register pair. In this case, the definition of Y requires

COHPLEX X, Y only one instruction, which adds 1.0 to the real part of the location shared by X and Y. Because there is no sec­ X = ond instruction to indicate completion of the defini­ Y =X+ [1.0, 0.0) tion, the definition will be lost by our implementation.

Digital Technical Journal Vo l. 10 No. 1 1998 89 Semantic Stepping Not all such instructions are appropriate, however. We start with an initial set of candidate instructions A major problem with stepping by line though opti­ and refine it. The fol lowing sections describe the mized code is that the apparent source program loca­ heuristics that are currently in use. tion "bounces" back and fo rth, with the same line often appearing again and again. In large part this Assignment The candidates fo r assignment evcms bouncing is due to a compiler optimization called are the instructions that assign a value to a variable (or code scheduling, in which instructions that arise fr om to one of its split children). The second instruction in the same source line are scheduled, that is, reordered an assignment to a two-part variable is excluded. and intermixed with other instructions, fo r better exe­ Stopping between the two assignmcms is inadvisable cution performance. because at that point the variable no longer has the OpenVMS DEBUG, like most de buggers, interprets complete old state and docs not yet have the complete the STEP/LINE (step by line) command to mean that new state. the program should execute until the line number changes. Line numbers change more ti·cqucntly in Branches There are two kinds of branch: uncondi­ scheduled code than in unoptimized code. tional and conditional. An unconditional branch may For example, in sample programs ti·om the SPEC95 have a known destination or an unknown destination. Benchmark Suite, the average number of instructions Unconditional branches with known destinations in sequence that share the same line number is typi­ most often arise as part of some larger semantic con­ cally between 2 and 3-and typically 50 to 70 percent struct such as an if-then-else or a loop. For example, of those sequences consist of just 1 instruction! In code fo r an if-then-else construct generally has an contrast, if only instruction-level scheduling is dis­ implicit join that occurs at the end of the statement. abled, then the average number of instructions is The join takes the fo rm of a jump fi·om the end of one between 4 and 6, with 20 to 30 percent consisting of alternative to the location just past the last instruction one instruction. In a compilation with no optimiza­ of the other (which has no explicit jump and fa lls tion, there are 8 to 12 instructions in a sequence, with through into the next statement). This jump turns the roughly 5 percem consisting of a single instruction. inherently symmetric join at the source level into an A second problem with stepping by line through an asymmetric construction at the code stream level . optimized program is that, because of the behavior of Unconditional jumps almost never define interest­ revisiting the same line again and again, the user is ing semantic events-some re lated instruction usually never quite sure when the line has finished executing. provides a more useful event point, such as the termi­ It is unclear when an assignment actually occurs or a nation test in the case of a loop. One exception is a control flow decision is abom to be made. simple goto statement, but these arc very often opti­ In unoptimizcd code, when a user requests a break­ mized away in any case. Consequently, unconditional point on a certain line, the user expects execution to branches with known destinations arc not treated as stop just before that line, hence before the line is car­ semantic events. ried out. In optimized code, however, there is no well­ Unconditional branches with unknown destinJ­ defined location that is "before the line is carried out," tions are really conditional branches: they arise fr om because the code fo r that line is ty pically scattered constructs such as a C swi tch statement implemented about, intermixed, and even combined with the code as a table dispatch or a FORTRAN assigned GO TO state­ fo r various other lines. It is usually possible, however, ment. These branches definitely arc interesting points to identifY !be instruction that actually carries out the at vvhich to allow user interaction before the new cfkct of the line. direction is taken. Thus, the compiler retains uncon­ ditional branches as semantic events. Semantic Event Concept Similarly, in general, conditional branches to known We introd uce a new kind of stepping mode called destinations are important semanticevent points. Often semantic stepping to address these problems. Semantic more t11JJ1 one branch instruction is generated rcJ r a sin­ stepping allows the program to execute up to, but not gle high-level source construct, rex example, a decision including, an instruction that causes a semantic eftect. tree of tests and branches used to implement J small Instructions that cause semantic eftixtsare instructions C switch statement. In this case, on ly the first in the that execution sequence is used as the semantic event point.

• Assign a value to a user variable Calls Most calls are visible to a user and constitute • Make a control flow decision semantically interesting events. However, calls to • Make a routine call some run-time library routi nes arc usually not interest-

90 Digiral Technical )ounul Vol. 10 1'-J o. 1 1998 ing because these calls are perceived to be merely soft­ already marks branches with the semantic event ware implementations of primitive operations, such as attrib�1te, if appropriate. Also unlike the u·aditional integer division in the case of the Alpha architecture. stepping-by-line algorithm, the new algorithm does GEM internally marks calls to all its own run-time sup­ not consider d1 e source line number. port routines �s not semantically interesting. CompiJer fr ont ends accomplish this where appropriate fo r their Visible Effect own set of run-time support routines by setting a flag With semantic stepping, a user's perception of fo rward on the associated entry symbol node. progress through the code is no longer dominated by the side effects of code scheduling, that is, stopping Compiler Processing every fe w insm.Ktions regardless of what is happening. In most cases, the compiler can identifysemantic event Rather,this perception is much more closely related to locations by simple predicates on each instruction. the actual semantic behavior, that is, stopping every The exceptionsare statement or so, independent of how many instruc­ tions fr om disparate statements may have executed. • The second of the tvvo instructions that assign val­ Note that jumping fo rward and backward in the ues to a two-part variable is identified during split source may still occur, fo r example, when code motions lifetime analysis. have changed the order in which semantic actions are • Conditional branches that are part of a larger con­ performed. Nothing about semantic event handling su·uct are identified during a simple pass over the attempts to hide such reordering. How graph. lnlining Object Module Representation

The object module debugging semantic event repre­ Procedure call inlining can be confusing when using a sentation contains a sequence of address and event traditional debugger. For example, if routine INNER kind pairs, in ascending address order. is inlined into ro utine CALLER and the current point of execution is within INNER, should the debugger Debugger Processing report the current source location as at a location in Semantic stepping in the debugger involves a new the caller routine or in the called routine? Neither is algorithm fo r determining the range of instructions to completely satisfactory by itself. If the current line is execute. This algorithm is built on a debugger primi­ reported as at the location within INNER, then that tive mechanism that supports fu ll-speed execution of information will appear to conflict with information user instructions within a given range of addresses but from a call stack traceback, which wo uld not show u·aps any transter out of that range, whether by reach­ routine INNER. If the current line is reported as ing the end or by executing any kind of branch or call though in CALLER, then relevant location informa­ instruction. tion fr om the callee will be obscured or suppressed . Semantic stepping works as fo llows. Starting with Wo rse yet, in the case of nested inlining, potentially the current program counter address, Open VMS crucial information about tl1e intermediate call path DEBUG rinds the next higher address that is a seman­ may not be available in any to rm . tic event point; this is the target event point. The problem of dealing with inlining was solved OpenVMS DEBUG executes instructions in the long ago by Zellweger'-at least the topic has not address range that starts at the address of the current been treated again since. Zellweger's approach adds instruction and ends at the instruction that precedes additionalinf ormation to an otherwise traditional table the target event point. The range execution terminates that maps fr om instruction addresses to the corre­ in the fo llowing two cases: sponding source line numbers. Our approach is diffe r­

l. If the next instruction to execute is the target event ent: it includes additional information in the scope point, then execution reached the end of target description of the debugging symbol table. range and the step operation is complete. A key underpinning fo r inline support is the ability to accurately describe scopes that consist of multiple 2. If the next insu·uction to execute is not the target discontiguous ranges of instruction addresses, rather event point, then the next address becomes the cur­ than the traditional single range. This capability is rent address and the process repeats (silently). quite independent of inlining as such. However, Note that, unlike the algorithm that determines the because code fr om an inlined routine is fr eely sched­ range fo r stepping by line, the new algoritl1m does not uled with other code fr om the calling context, dealing require an explicit test fo r the ki nd of instruction, in accurately with the resulting disjoint scopes 1s an particular, to test if it is a kind of branch. The compiler essential buiJd ing block for effective support.

Digital Technical Journal Vo l. 10 No. 1 1998 91 Goals for Debugger Support • The actual argu ments of the call are transformed Our overall goal is to support debugging of inlined into assignments that initiJiize the values ofthe sur­ code with expected behavior, that is, as though the rogate parameter variables. inlining has not occurred. More specifically, we seek to • The inline scope is also made to contain a bod)' provide the ability to scope, which is a copy of the body of the in lined • Report the source locJtion corresponding to the routine, including a copy of its local variables. current position in the code • The original call is replaced with a jump to a copy of • Display parameters and local variables of an i. nlined the IL for the body of the routine, in which refer­ routine ences to declarations or parameters of the routi ne are replaced with references to their corresponding • Show a traceback that includes call fr ames corre­ copied declarations. In addition, returns ti·om the sponding to inlined routines routine are replaced with jumps back to rhe tuple • Set a breakpoi nt at a given routine entry following the original call. • Set a breJkpoint at a given line number (from • Similar "boundary adj ustments" Jre made to deal within an inlined ro utine) with function results, output parameters, choice of • Call an inlined ro utine entry point (when there is more than one, as might occur fo r FORTRAN alternate entry statements), We have achieved these goals to a substantial extent. etc. (The bookkeeping is a bit intricate, but it is conceptually straightforw:�rd .) GEM Locators Bdore describing the mechanisms to r inlining, we The calling routine, which now incorporates a copy introduce the GEM notion of a locator. A locator of the inlined routine, is then fu rther processed as a describes a place in the source text. The simplest kinds normal (though larger) routine. of locator describe a poi nt in the source, including the name of the file, the line within that file, and the col­ lnlining Annotations for Debugging The main changes umn within that line; they even describe the point at introduced fo r debugging optimized code support are which that filewas included by another file(a s fo r a C as fo llows. or C++ #include directive), if applicable. • The newly created inline scope block is annotated A crucial characteristic oflocators is that they are aU with additional information, namely, of a uniform fixed size that is no larger than an integer - A pointer to the routine declaration being inlined. or pointer. (How this is achieved is beyond the scope of this paper.) In particular, locators are smal l enough The locator fi·om the call that is replaced. In a sim­ that every tuple node in the intermediate language ple call wid1 no arguments, there may be nothing (IL) and every code cell in the ge nerated code stream leftin the IL fr om the original call after inlining is contains one. Moreover, GEM as a whole is quite completed; dtis locator captures the original call meticulous about maintaining and propagating high­ location fo r possible later use, fo r example, as J quality locator information throughout its optimiza­ supplement to d1e information thJt maps instruc­ tion and code generation. tionad dresses to source line numbers. An additional ki nd of locator was introduced fo r • As the code list of the original inlined routine is inlining support. This inline locator encodes a pair copied, each locator from the original is replaced by that consists of a locator (which may also be an inlin e a new inline locator that records

locator) and the add ress of an associated scope node in - The original locator. the GEM symbol table. - The newly created inline scope into which it is being copied. Compiler Processing Debugging optimized code su pport to r inlining gen­ As a result of these steps, every inlined instruction Gin erally builds on and is a minor en hancement of the be related back to the scope into which it was inlined GEM inlining mechanism. lnlining occurs during an and hence to the routine ri·om which it was inlined, early part of the GEM optimizer phase. regardless of how it may be modi tied or moved as a Inlining is implemented in GEM as fo llows: result of subsequent optimization . Note dut these additional steps arc an exception to • inline Within the scope that contains the call site, an the general assertion th at debugging optimized code scope block is introd uced. This scope represents the su pport occurs after code ge neration and just prior to result of the inlining operation. It is populated with object code emission. These steps in no vvay int1 uence local variable declarations that correspond one-to­ the generated code-only the debugging symbol table one with d1e to rmal parameters ofd1e i.nlined routine. that is output.

92 Digiral Te chnid Journ:1l Vo l. 10 No. I 1998 Prologue and Epilogue Sets The prologue of a rou­ through the tlow graph looking fo r the last instruction tine generally consists of those instructions at the (that is, the instruction closest to the routine exit) of beginning of the routine that establish the routine an inline instance that can reach Jn exit. stack fr ame (for example, allocate stack and save the Note that prologue and epilogue sets are not strictly return address and other preserved registers) and that symmetric: prologue sets consist of only instructions tl1at must be executed before a debugger can usefuUy inter­ are also semantic events, whereas epilogue sets include pret the state of the routine. For this reason, setting a instructions tlut may or may not be semantic events. breakpoint at the beginning of a routine is usual ly (transparently) implemented by setting a breakpoint Object Module Representation afterthe prologue of that routine is completed. To describe any inlining that may have occurred dur­ Conversely, the epilogue of a routine consists of ing compilation, we include three new ki nds of infor­ those instructions at the end of a routine that tear mation in the debugging symbol table. down the stack fra me, reestablish the caller's context, If tl1e instructions contained in J scope do not fo rm a and make the return value, if any, available to the single contiguous range, then the description of the caller. For this reason, stopping at the end of a routine scope is augmented vvjth a discontiguous range descrip­ is usually (transparently) implemented by setting a tion. This description consists of J sequence of ranges. breakpoint before the epilogue of that routine begins. (The scope itself indicates tl1e traditional approximate One benefit of inlining is that most prologue and range description to provide backward compati bility epilogue code is avoided; however, there may still be with older versions of OpenVMS DEBUC). This aug­ some scope management associated with scope entry mented desc1iption applies to aU scopes, whether or not and exit. Also, some programming language-related they are tl1e result ofinlining. environment management associated with the scope For a scope that results fr om inlining a cal l, the may exist and should be treated in a manner analogous description of the scope is augmented with a record to traditional prologue and epilogue code. The prob­ that refers to the routine that was inlined as well as the lem is hovv to identif)' it, because most of the tradi­ line number of the call. Each scope also contains two tional compiler code generation hooks do not apply. entries that consist of the sequence of prologue Jnd The model we chose takes advantage of the seman­ epilogue addresses, respectively. tic event information that we describe in the section Backward compatibility is fully maintained . An older Semantic Stepping. In particular, we define the first version of Open VMS DEBUC that does not recognize semantic event that can be executed within the in lined the new kinds ofinfonnation wi ll simply ignore it. routine to be the end of the prologue. For reasons dis­ cussed later, we define the last instruction (not the last Debugger Pro cessing semantic event) of the inJined code as the beginning of As the debugger reads the debugging symbol table of the epilogue. As a result of unrelated optimization a module, it constructs a list of the inlined instances fo r effects, each of these may turn out to be a set of each routine. This process makes it possible to tl nd all instructions. Determination of inline prologue and instances ofa given routine. Note, however, that if every epilogue sets occurs after split lifetime and semantic call of the routine is expanded inline and the routine event determination is completed so that the results of cannot otherwise be called fi-om outside that module, those analyses can be used. tl1en CEM does not create a noninlined (closed-form) To determine the set ofprologue instructions,fo r each version of tl1e routine. inline instance, CEM starts vvjtl1 every possible entry

block and scans to rward through the flow graph looking ReportSour ce Location It is ;.1 simple process to report fo r tl1C first semanticevent instruction that can be reached the source location tl1atcorrespo nds to tl1e current code from that entry. The set of such instructions constitutes address. When stopped inside the code resulting fr om the prologue set fo r tl1at instance of the inJined routine. an inlined routine, the program cou nter maps directly This is a spanning walk fo rward fr om the routine to a source line ,vjthin the inlined routine. entry (or entries) that stops either when a block is fo und to contain an instruction fr om the given inline Display Parameters and Local Variables As is tl1e case

instance or when the block has alreJdy been encoun­ fo r a noninlined routine, tl1e scope description tor an tered (each block is considered at most once). Note inlined routine contains copies of the parameters and that there nuy be execution paths that include one or the local variables. No special processing is req uired to more instructions fr om an inlining, none of which is a perform name binding fo r such entities. semantic event instruction. The set of epilogue instructions is determined us ing Include lnlined Calls in Traceback The debugger pre­ an inverse of the prologue algorithm. The process sents inlined routines as if they are real routine calls. A starts with eJch possible exit block and scans backward stack frame whose current code address corresponds

Digital Technical )ourml Vo l. 10 No. I 1998 93 to an inlined routine instance is described with two or Line.:+•+++ F'ile DOCF'J-JNLI E - 2 . FOR more virtual stack frames: one or more fo r the inlined instance(s) and one fo r the ultimate cal ler. (An exam­ c �·la in .t.out:ne 2 c ple is shown later in Figure 7.) 3 1 ·-EGER A, c * 4 TYPE .� ( 3 ' c (0) 1 Set Breakpoints at lnlined Routine Instances The 5 END 6 c strategy fo r setting breakpoints at inlined routines is 7 F'UNCT !ON .'\ (I. L) based on a generalization of processing that previously 8 II TEGER B existed fo r C++ member fu nctions. Compilation of 9 A ; 8 ( 5 , I) + 2 "" L 10 RETURN C++ modules can result in code fo r a given member 11 END fu nction being compiled every time the class or tem­ 12 c pbte definition that contains the member function is 13 F'UNCTIO: B(J , K) 14 INTEGER B , c compiled. vVe refer to all these compilations as clones. - 15 B C(9) + J + K (It is not necessary to distinguish which of them is the 16 END originaL ) In our generalization, an inlined routine call +T+++ File DOCFJ- I NLI � E - 2 A . FOR instance is treated like a clone. To set a breakpoint at a c routine, the debugger sets breakpoints at all the end­ 2 fUNCT ION C(J) of�prologue addresses of every clone of the given rou­ 3 INTEGER C 4 c � 2 ' 1 tine:in all the currently active modules. 5 RETUR 6 END Set Breakpoints at Inlin ed Line Number Insta nces The

strategy fo r setting breakpoints on line numbers shares Figure 6 some teatures ofsetting breakpoints on routines, with Program to Jllustr:�rc In lining Support additional complications. Compiler-reported line num­ bers on OpenVMS systems are unique across all the files included in a compilation. It fo llows that the same these two ri les with those in Figure 7, we added line filein cluded in more than one compilation may have numbers to the left of the code. Note that these num­ differe nt associated line numbers. bers are not p:�rtof the program. To set a breakpoint at a particul:�r line number, If we compile , link, and run this program using the that line number needs to be fi rst nonn:1lized relative OpenVMS DEBUG option, we can step to a place in to the containing file. This normalized line number routine B that is just bdore the call to routine C and v:tlueis then compared to normalized line numbers tl1en request a traceb:1ck of the call stack. This dialogue fo r that same filethat are included in other compila­ is shown in Figure 7. tions. diffe rent versions of the same named file (If Figure 7 shows tlut pseudo stack fi-ames are reported occur in different compilations, the versions are fo r routines A and B, even tl1ough the call of routine B treated as unrelated.) The original line number is has been in lined into routine A and the call of rou tine A converted into the set of address ranges that corre­ has been inlined into the main program. The main dif spond to it in all modules, taking into account inlin­ fe rence from a real stack rr ame is the ext.rJ line that ing and cloning. reports tl1at tl1e "above routine is inlined ."

Call a Routine That Is lnlined If the compiler creates a Limitations closed-form version of a routine, then the debugger In a real stack ri·ame, it is possible to examine (and can call that routine independent of whether there even deposit into) the real machine registers, rather nuv also be inlined instances of tl1e routine. If no such than examine the variJbles that happen to be allocated ver�ion ofthe routine exists, then the debugger cannot in machine registers. In an inlined stack fra me, this call the routine. operation is not well ddlncd and consequently not supported. In a noninlined stack ti:a me, these opera­ Usage Example tions are still allowed . Inlining support has many aspects, but we will illus­ An attractive fe ature that would round out the trate only one-a call traceback that includes inlined expected beluvior ofinlined routine calls would be to ca lls. Consider the sample program shown in hgu re 6. support stepping into or over the inlined call in the This program has fo ur ro utines: tl1t-cc JIT combined in same way that is possible tc)r noni nlined caJls.Thi s rea­ :1 single file (enabling the GEM FORTRA.N compiler ture is not currently supported-execution alwJys to perform in line optimizations), and the last is in a steps into the ull. separate fi le. To help correlate the lines of code in

94 DigitJI TcchnicJI JournJI Vo l. 10 No. I 1998 GEMEVN$ r n D FJ -INLI 1E- 2 Open 1S Al ph a Debu 6 ve rsio� T7 . 2 -001 % ! , Lang ge : FORTRAN , Module : DOCFJ -Ir LINE-2$.1AIN

D G> s ep / seman tic stepped to DO CFJ -I LI E-2$MAI1 B\%LI E 15�8 15 : B = C(9) � J + K DBG > show calls mod e n e rou ine name line rel P abs PC * DOCFJ -I. LI E-2$MAI'l B 15 O O OOOO OOOOOOOOl C 0 0 000000002006 C ----- abo ve rou tine is inlined * DOCFJ- I LINE 2$ IN A 9 000000 0 0 0000000 4 0 000000002 054 ----- above tO ine is i . l ine' ' DOCFJ- I• r, E-2$MAI DO FJ - It\LINE-2 $:�. I. 4 0 0 0 00 00000000038 0 0 00000000020 038 0000000000000000 FFFFFFFF8590716C

Figure 7 OpenVMS DEBUG Dialogue to Illustrate Inlining Suppon

Performance and Resource Usage 51: no optimization (noopt), no debugging infor­ mation (nodebug, nodbgopt)

We gathered a number of statistics to determine t:ypi­ 52: no optimization (noopt), normal debugging cal resource requirements tOr using the enhanced information (debug, nodbgopt) debugging optimized code capability compared to the 54: full (default) optimization (opt), no debugging traditional practice of debugging unoptimized code. A information (nod ebug, nodbgopt) short summary of the findings fo llows. 55: full opti mization (opt), normal debugging • All metrics tend to show wide variance fi·om pro­ information only (debug, nodbgopt) gram to program, especially small ones. 58: full optimization (opt), enhanced debugging • Generating traditional debugging symbol information information (debug, dbgopt) increases the size of object modules typically by 50 to 100 percent on the OpenVMS system. Executable Note that the option combination numbering sys­ image sizes show similar but smaller size increases. tem is historical; we retained the system to help keep data logs consistent over ti me. • Generating enhanced symbol table information adds about 2 to 5 percent to the typical compilation Compile-time Speed time, although higher percentages have been seen The incremental compile-time cost of creating enhanced to r unusually large programs. symbol table information is presented i.n Ta ble l fo r a • Generating enhanced symbol table information sampling of BLISS, C, and FORTRAN modules. The uses significant memory during compilation bm data in this table can be summaJized as fo llows: does not affect the peak memory require ment of a compilation. • Traditional debugging (column 1) increases the total compilation timeby about l percent. • Generating enhanced symbol table information further increases the size of the symbol table intor­ • Enhanced debugging (column 2) increases the mation compared to that fo r an unoptimized com· compilation time by about 4 percent. The largest pilation. On the OpenVMS system, this adds 100 to component of that ti me, approxi mately 3 percent, 200 percent to the debugging symbol table of is attributed to the flow analysis involved in han­ object modules and perhaps 50 to 100 percent fo r dling split lifetime variables (column 3). executable images. • Debugging tends to increase as a percentage of

• Compiling with full optimization reduces the ti me in larger modules, which suggests that pro­ re sulting image size . Total net image size increases cessing time is slightly nonlinear in program size; typical ly by 50 to 80 percent. however, this increase does not seem to be excessive even in very large modules. A more detailed presentation of findings fo llows. Ta bles 1 through 3 present data collected using pro­ Compile-time Space duction OpenVMS Al pha native compilers built in The compile-time memory usage during the creation of December 1996. In developing these results, we used enhanced symbol information is presented in Table 2. five combinations of compilation options as fo llows:

Digital Te chnical )ourn;ll Vol. 10 No. I 1998 95 Ta ble 1 Percent of Compilation Ti me Used to Create/Output Debugging Information

S2 (noopt, debug, S8 (opt, debug, (Split lifetime Module nodbgopt) dbgopt) Analysis Only)

BLISS CODE GEM_AN 0.3% 1.1% 0.7% GEM_DB 0.9 1.8 1.3 GEM_DF 0.8 5.2 4.4 GEM_FB 0.7 3.5 2.7 GEM_I l_PEEP 0.6 14.4 13.9

CCODE (_METRIC 1.5 5.2 4.1 GRAM 0.5 2.9 2.2 INTERP 1.2 4.5 3.2

FORTRAN CODE MATRIX300X nm nm nm NAGl 1.4 13.0 11.9 SPICE_V07 3.0 6.4 4.7 WAVEX 2.5 6.3 4.8

Average 1.2% 4.3% 3.2% Ty pical range (0.5%-1 .5%) (3.0%-7.0%) (2.0%-5.0%)

Note: "nm" represents "not meaningful," that is, too small to be accurately measured.

Ta ble 2 Key Dynamic Memory Zone Sizes during BLISS GEM Compilations

Peak SYMBOL Ell CODE OM % % % File Total ZONE ZONE ZONE ZONE Peak Larg Ell

BLISS CODE GEM_AN 2,507 130 85 184 15 6% 8% 18% GEM_DF 11,305 836 1,672 2,056 1'18 0 10 57 71 GEM_FB 4,694 316 522 457 304 6 58 58 GEM_Il_pEEP 40,419 1,606 17,666 4.41 1 14,143 34 80 80

CCODE C_METRIC 7,381 1' 115 494 2,563 167 2 6 34 GRAM 3,03 1 82 815 211 267 9 33 33 INTERP 3,563 354 308 688 131 4 20 43

FORTRAN CODE MATRIX300X 934 143 227 101 58 6 26 26 NAG l 6,267 1,520 1,791 1.742 68 11 38 38 SPICE_V07 6,234 1,051 3,256 885 459 7 14 14 WAVEX 12,812 4,676 3, 119 3.482 68 5 14 22

Average 9% 32% 40% Note: All numbers to the left of the are thousands of bytes. not multiples of 1,024.

Column Key: Column Description

Peak Total The peak dynamic memory allocated in all zones during the compilation SY MBOL ZONE The zone that holds the GEM sy mbol table Ell ZONE The zone that holds the largest Ell ZONE (used for the expanded intermediate representation) CO DE ZONE The zone that holds the GEM generated code list OM ZONE The zone that holds split lifetime and other working data %Peak The OM ZONE size as a percentage of the Peak Total size %larg The OM ZONE size as a percentage of the largest single zone in the compilation %Ell The OM ZONE size as a percentage of the Ell ZONE size

96 Digir:UTcchn ica.l Journal Vol . 10 No. 1 1998 The fo llowing is a summary of the data, where OM image text, etc.) due to the inclusion of enhanced infor­ ZONE refers to the temporary working virtual mem­ mation compared to the traditional symbol table size. ory zone used to r split lifetime analysis: SS/52: This ratio shows the object or image size with enhanced debugging information with opti­ • The OM ZONE size averages about 10 percent of the peak compilation size. mization compared to the traditional debugging size without optimization. • The OM ZONE size is one-quarter to one-halfof the EIL ZONE size. (The latter is well known fo r typi­ The last ratio, SS/52, is especially interesting because cally being the largest zone in a GEM compilation.) it combines two effects: ( l) the reduction in size as a result of compiler optimization, and (2) the increase in • Since the OM ZONE iscreated and destroyed afteraU size because the larger debugging symbol table needed ElL ZONEs are destroyed, the OM ZONE does not to describe the result of the optimization. The result­ conuibutc to establishing the peak total size. ing net increase is reasonably modest.

Object Module Size Summaryand Conclusions The increased size of enhanced symbol table informa­ tion to r both object files and executable image files is There exists a small but significantlit erature regarding shown in Table 3. the debugging of optimized code, yet very kw debug­ In Table 3, the applicationor group of modules is iden­ gers take advantage of what is known. In this paper we tified in the first column. The columns labeled 51, 52, etc. describe the new capabilities tor debugging optimized give theresulting size fo r the combinationof compilation code that are now supported in the GEM compiler sys­ options described earlier. Object module and executable tem and the Open VMS DEBUG component of the image data is presented insuccessiv e rows. OpenVMS Alpha operating system. These capabilities Three ratios of particular interest arc computed. deal with split lifetime variables and currency determi­ 52/51: This ratio shows the object or image size nation, semantic stepping, and procedure inlining. For with traditional debugging information compared each case, we describe the problem addressed and then to a base compilation without any debugging infor­ present an overview of GEM compiler and OpenVMS mation. This ratio indicates the additional cost, in DEBUG processing and the object module represen­ terms of increased object and image filesize, associ­ tation that mediates between them. All but the inlin­ ated with doing traditional symbolic debugging. ing support are included in OpenVMS DEBUG V7.0

(S8-S5)/(S2-S1): This ratio shows the increase in and in GEM-based compilers fo r Alphasystems that debuggingsymbol table size (exclusive of base object, have been shipping since 1996. The inlining support is

Ta ble 3 Object/Executable (.OBJ/.EXE) File Sizes (in Number of Blocks) for Va rious OpenVMS Components

51 52 54 55 58 noopt no opt opt opt opt (58-55)/ nodebug debug 52/51 node bug debug debug (52-51) 58/52 File nodbgopt nodbgopt Ratio nodbdopt nodbgopt dbgopt Ratio Ratio

BLISS CODE GEM_*. OBJ 31,477 51,069 1.62 27,483 47,031 68,728 1.11 1.35 GEM_*.EXE 12,160 29,543 2.43 10,373 27,755 32,288 0.26 1.09 CCODE C_M ETRIC.OBJ 436 653 1.50 478 733 1,680 4.36 2.57 C_M ETRIC.EXE 250 348 1.39 250 385 581 2.00 1.67 GRAM.OBJ 102 120 1.19 100 117 224 5.94 1.87 GRAM.EXE 60 70 1.17 58 69 91 2.20 1.30 INTERP.OBJ 140 207 1.48 134 205 450 3.66 2.17 INTERP.EXE 80 113 1.41 75 113 167 1.64 1.47 FORTRAN CODE MATRIX300X.OBJ 20 34 1.70 16 29 71 3.00 2.08 MATRIX300X.EXE 19 29 1.53 15 25 34 0.90 1.17 NAGL.OBJ 42 63 1.51 288 509 1 '178 3.1 1 1.84 NAGL.EXE 289 388 1.34 187 333 469 1.37 1.21 SPICE.OBJ 1,652 3, 117 1.89 1,073 2,571 4,916 1.60 1.58 SPICE.EXE 1,031 1,660 1.61 549 1,318 1,803 0.77 1.09 WAVEX.OBJ 555 1,639 2.95 393 1,556 2,949 1.29 1.80 I WAVEX.EXE 634 1,190 1.88 490 1,167 1,437 0.49 1.21 I DigitalTe chnical journal Vol. lO No. !998 97 currently in tleJd test. Wo rk is under way to provide 12. A. Adl-Tabatabi, "Source-Level Debugging of Glob­ similar capabilities in the lade bug debugger"·" compo­ ally Optimized Code," Ph . D. Dissertation , Carm: : nent of the DIGITAL UNIX operating system. Mel lon Universi ty, CM U-CS-96- 133 (June 1996).

There are and willalways be more opportunities and 13. D. B lickstein et al ., "The GEM O ptim izing Compiler new challenges to improve the ability to debug opti­ System," /.!igilal Te ch nicaljoumal, I'Ol. 4, no. 4 ( Spe­ mized code. Perhaps tl1e biggest problem of all is to fig­ cial lssue 1992): 121-1 36. ure out where best to fo cus future anention. lt is easy to 14. B. Beander, "V&V.. DEB UG: An Interactive, Symbolic, see how the capabilities described in this paper provide Multilingual Debugger," ACM 5!CSOFT/5! G'PI..ANSo ft­ major benefits. We find it much harder to see what capa­ war(' Eugineering S)lmpusium on I-Jigb-Lel'el Debug­ bility cou ld provide the next major incrementin debug­ ging ACM 5/CPLA N Notices. vol . 18, no. 8 (August ging effectiveness when working witl1 optimized code. 1983): 173-179.

15. Open \I,VfSDehugp, er /vfan uol, Order No. i\A- QSBJ B­ References TE ( Maynard, Mass.: Digital Equipme nt Corporation, November 1996). l. P. Zellwt:gt:r, "Interactive Source- Level Debugging of 16. R. Sites , ed., Alpha Architecture Reference Jllh! ll!la/. Optimized Programs," Ph.D. D issertation, U niversity 3d ed. (VVob urn, Mass. Digital Press, 1998 ). of California,Xerox PARC CSL-84-5 ( M ay 1984 ).

i 17. T. Bing:hJm,N. Hobbs, and D. Husson, ''Experiences 2. J. Hennessy, "Symbolic Debuggng ofOptimized Code," Developing and Using an Object-Oriented Library tor ACM ]i·( nlsactions on Programming Languoges (/1/d Program Manipulat1011," OOP5LA c'rJII(i:rence Pro ­ 5)s/ems. vol 4, no. 3 (July 1982 ): 323-344. ceedings. ACM 5/CPLAN No tices. vol. 12, no. lO 3. M. Copperman, "Deb uggi ng Optimized Code With­ (October 1993): 83-89. our B eing Misled," Ph.D. Dissertation, University of 18. D(v,ital UNIX Ladehug [)ehu,�er Man ual, Order No. California at Santa Cruz, UCSC Technical Report AA-PZ7EE-Tl TE ( Mavna.rd, Mass.: Digital Eq uipment UCSC-CRL-93-21 (June II, 1993). Corporation, March 1996). 4. G. B rooks , G. Hansen, and S. Simmons, "A New ApproJch to Debuggi ng Optim ized Code ," ACW SIC­ Biographies PLAN '92 Co nference on Prog ram ming Language Desig n and !mplementalion, SJG'PLAN No tices, vol 27, no. 7 (July 1992): 1-ll

5. Convex Computer Corporation , CONVJ;X CXdh Co n­ cepts (Richardson, Tex .: Convex Press, Order No. DSW-471, May 1991).

6. D. Coutant, S. Meloy, and M. Ruscetta, "DOC: A PrJc­ tical Approach to Source-Level De bugging of G lobally Optimized Code," Proceedi ngs oft he 5/CPLANCon­ 8H ferenw on Pro,({ rctmmincf� La n,r.; uage Desig n and Imple­ mentation. Atlanta, Ga. (June 22-24, 1988): 125-1 34.

7. L. Zurawski, "Source-Level Debuggingof GlobaLly Opti­ mized Code witl1 Expected Behavior," Ph .D. Disserta­ Ronald F. Brender tion, University of Illinois at Urbana-Champaign ( 1989). Ro nald F. B rt:nde r is a senior consultant software engineer 8. U. Holz le , C. Chambers, �111d D. U ngar, "Debugging in Compaq's Core Technology Group, where he is working Optimized Code with Dynamic Deopti mi zation, " on botl1 the GEM compi ler and the l • 1IX lade bu g pro­ ACM SIG'PLA N '92 Co uference on Programming Lclll­ jec ts. D uring his oreer, Ron bas worked in advanced deve lopment and product development roles fo r BLISS, gua.lw Desig u and lmplemenlalion, , FOKfR.:\ N, Ada, and multi.bngmgt: debuggin g on DIGITAL's Calif (June 17-19, 1992 ) and SIGPLAi'-l Notices, vol. DECsy�t<:m-10, PDP-11, VAX , and Alpha computn systems. 27, no. 7 (July 1992): 32-43. He served as a representativeon the ANSI and ISO standards

9. L. Pollock an d M. Soffa , " H igh - level Debuggi ng with committ<:es tor fO Rl'RAN 77 and later fo r Ada 83, also sel·v­ the Aid of' an Incremenral Optim izer," Proceedings o( ing as a U.S. Department ofDcti::nse invited Disti nguished the .llsi J-Jmuaii/nlernalioua/ Conference on 5rstem Reviewer Jnd a mem be r of the Ada Board and the Ada La nguage Maintenance Com mittee to r more than eight Scieuces (January 1988 ): 524-5 32. years. Ron joi ned Digital Equipment COlvorat.ionin 1970, 10. L. Po llock, LV! . B i1·en s, and M. Soffa , "Debuggi ng afterearning the degrees of B . .S. t:. ( <: ngineeri ng scienc<:s), Op timized Code via Tailori ng," fulernational5ympu­ M.S. (applied mathematics) , and Ph.D. (computer .1 11d silun on Software Tes ting and Ana/)JSis (August 1994). communication sciences) in 1965, 1968, and 1969, respec­ tively, aJ l ti·om me University of Michig

98 Digit:�!Tc chniccll Journ:�l Vo l. 10 No. l !998 Jeffrey E. Nelson Jeffrey E. Nelson is a senior software developer at Candle Corporation in Minneapolis, Minnesot:�. He currently develops message broker software fo r Roma BSP, Canc!Je's middleware framework tor integrating business applications. Previously :-ttDIG !TAL, Jeff was a principal software engineer on tl1e OpenVMS and ladebug debugger projects. He spe­ cialized in debug symbol table fo rmats, run-time language support, and computer architecUJre support. He conuibuted to portingtl1e OpenVMS debugger fi·om the V&'\. to tl1e Alpha platfo rm. He represented DIGITAL on me indumy­ ,,�de PLSIG committee mat developed the DWARF debug­ ging symbol table format. Jeff holds an M.S. degree in computer science and applications ti·omVirgi nia Polytechnic Institute and State University and a B.S. degree in computer science tTom the University ofWisconsin-LaCrosse. Jeff is an alumnus of the Graduate Engineering Ed ucation Program (GEEP), has been awarded one patent, and has previously published and presented work in me area of real-time,object ­ oriented garbage collection.

MarkE. Arsenault Mark E. Arsenault is a principal software engineer in Compaq's OpenVMS Engineering Group working on tl1e OpenVMS debugg.:r. Mark has implemented support in the debugger for 64-bit addressing, C++, and inlining. He joined DIGITAL in 1981 and has worked on several soti:­ ware development tools, including tl1e BLISS compiler and the Source Code Analyzer. Mark holds two patents, one each for the HL·ap Analyzer and fo r the Correlation Facility. He received a B.A. in physics from Boston University in 1981.

Digital Technical Journal Vo l. 10 No. l 1998 99 I William M. McKeeman

Differential Testing for Software

Differential testing, a form of random testing, The Te sting Problem is a component of a mature testing technology for large software systems. It complements Successfi.d commercial computer systems contain te ns of millions of lines of handwritten software, Jll of regression testing based on commercial test which is subject to c hange as competitive pressures suites and tests locally developed during prod­ motivate the addition of new fe atures in each release. uct development and deployment. Differential As a practical matter, quality is not a question of cor­ testing requires that two or more comparable rectness, but rather of how many bugs are fixed and systems be available to the tester. These sys­ how fe w arc introd uced in the ongoing development tems are presented with an exhaustive series process. If the bug count is increasing, the software is deteriorating. of mechanically generated test cases. If (we might say when) the results differ or one of Quality the systems loops indefinitely or crashes, the Te sting is a major contributor to quality-it is the last tester has a candidate for a bug-exposing test. chance fo r the development organization to reduce Implementing differential testing is an interest­ the number of bugs delivered to customers. Typically, ing technical problem. Getting it into use is an developers build a suite of tests that the software must pass to advance to a new release. Three major sources even more interesting social challenge. This of such tests arc the development engineers, who paper is derived from experience in differential know where to probe the weJk points; commercial test testing of compilers and run-time systems at suites, which :tre the arbiters of contonnance; and cus­ DIGITAL over the last few years and recently tomer complaints, which developers must address to

at Compaq. A working prototype for testing win customer loya lty. All three types of test cases are C compilers is available on the web. relevant to customer satisfaction and therefore have

value to the develope rs . The resultant test suite to r the software under test becomes intellectual property, encapsulates the accumulated experience of problem fixes,and can contain more lines ofcode than the soft­ ware itself. Testing is always incomplete. The simplest measure of completeness is statement coverage. Instrumentation can be added to the software bdore it is tested. When a test is run, the instrumentation generates a report detailing which statements are actually executed. Obviously, code that is not executed was not tested. Random testing is J way to make testing more com­ plete. One value of random testing is introducing the unexpected test-1 ,000 monkeys on the keyboard can produce some surprising and even amusing input1 The traditional approach to acquiring such input is to let university students use the software. Testi ng software is Jn active field of endeavor. Interesting starting points fo r gathering background

I 00 Digiul Tcd1nic�l Jounul Vol 10 No. I 1998 information and references are the we b site main­ rect. The very complexity of modern software that tained by Sofuvare Research, Inc. 1 and the book drives us to construct tests makes it impractical to pro­ Sojiware Testing and Quality Assurance.2 vide a priori knowledge of the expected results. The problem is worse fo r randomly generated tests. There Developer Distaste is not likely to be a higher level of reasoning that can A development team with a substantial bug backlog be applied, which fo rces the tester to instead fo llow does not find it helpful to have an automatic bug the tedious steps that the computer will carry out dur­ finder continually increasing the backlog. The team ing the test run. An oracle is needed. priority is to address customer complaints before deal­ One class of results is easy to evaluate: program ing with bugs detected by a robot. Engineers argue crashes. A crash is never the right answer. In the triage that the randomly produced tests do not uncover that drives a maintenance effort, crashes are assigned to errors that are likely to bother customers. "Nobody the top priority category. Although this paper does not would do that, " "That error is not important," and contain an in-depth discussion of crashes, all crashes "Don't waste our ti me; we have plenty of real errors caused by difterential testing are reported and consti­ to fix" are typical developer retorts. tute a substantial portion of the discovered bugs. The complaints have a substantial basis. During a visit Diffe rential testing, which is covered in the fo Uowing to our development group, Proti::ssor C. A. R. Hoare of section, provides part of the solution to the problem of Oxford University succincdy summarized one class of needing an oracle. The remainder of the solution is dis­ complaints: "You cannot fix an infinite number of bugs cussed in the section entitled Test Reduction. one at a time." Some software needs a stronger remedy thana stream of bug reports. Moreover, a stream of bug DifferentialTes ting reports may consume the energy that could be applied in more general and productive ways. Diffe rential testing addresses a specific problem-the The developer push back just described indicates that cost of evaluating test results. Every test yields some a differential testing effort must be based on a per­ result. If a single test is te d to several comparable pro­ ceived need tor better testing fr om within the product grams (for example, several C compilers), and one pro­ development team. Performing the testing is poindess gram gives a different result, a bug may have been if the developers cannot or will not use the results. exposed . For usable sofhvare,very few generated tests Difkrential testing is most easily applicable to soft­ will result in differences. Because it is fe asible to gener­ ware whose quality is already under control, that is, ate millions of tests, even a fe w diffe rences can result in software fo r which there arc fe w known outstanding a substantial stream of detected bugs. The trade-off is errors. Ru nning a very large number of tests and to use many computer cycles instead of human effort to expending team eftort only when an error is fo und design and evaluate tests. Particle physicists use the becomes an attractive alternative. Team members' same paradigm: they examine millions ofmosdy boring morale increases when the software passes millions of events to find a tew high-interest particle interactions. hard tests and test coverage of their code expands. Several issues must be addressed to make diffe ren­ The technology should be important to r applica­ ti al testing effective. The first issue concerns the qual­ tions fo r which there is a high premium on correct­ ity of the test. Any random string fe d to a C compiler ness. In particular, product differentiation can be yields some result-most likely a diagnostic. Feeding achieved tor software that has few fa ilures in compari­ random strings to the compiler soon becomes unpro­ son to the competition. Diffe rential testing is designed ductive, however, because these tests provide only to provide such comparisons. shallow coverage of the compiler logic. Developers The technology should also be important fo r appli­ must devise tests that drive deep into the tested com­ cations t(x which there is a high premium on indepen­ piler. The second issue relates to fa lse positives. The dently duplicating the behavior of some existing results of two tested programs may diffe r and yet application. Identical behavior is important when old still be correct, depending on the requirements. For sofhvare is being retired in tavor of a new implementa­ example, a C compiler may freely choose among alter­ ti on, or when the new software is challenging a domi­ natives fo r unspecified, undefined, or implementation­ nant competitor. defined constructs as detailed in the C Standard .' Similarly, even to r required diagnostics, the fo rm of Seeking an Oracle the diagnostic is unspecifiedand therefore difficultto The ugliest problem in testing is evaluating the result compare across systems. The third issue deals with the of a test. A regression harness can automatically check amount of noise in the generated test case. Given a that a resu lt has not changed , but this intormation successful random test, there is likely to be a much serves no purpose unless the result is known to be cor- shorter test that exposes the same bug. The developer

Digital Te chnical journal Vo l. 10 No. I 1998 101 who is seeking ro fix the bug strongly prefers to use the level. One compiler could not handle OxOOOOOl if shorter test. The fo urth issue concerns comparingpro­ there were too many leading zeros in the hexadecimal grams that must run on diHerent platforms. Diffe rential number. Another compiler crashed when faced with testing is easily adapted to distributed testing. the tloating-point constant l E l 000. lvlany compilers fa iled to properly process digraphs and trigraphs. Te st Case Quality Stochastic Grammar ·writing good tests requires a deep knowledge of the A vocabulary is a set of two kinds of symbols: terminal system under test. Writing a good test generator and nonterminal. The terminal symbols are what one requires embedding that same knowledge in the gen­ can write down. The nonterminal symbols are names erator. This section presents the testing of C compilers fo r higher level language structures. For example, the as an example. symbol "+" is a terminal symbol, and the symbol "additive-expression" is a non terminal symbol of the Te sting C Compilers C programming language. A grammar is a set of rules For a C compiler, we constructed sample C source files fo r describing a language. A rule has a left side and a at several levels of increasing quality. right side. The leftside is always a nonterminal sym­ bol. The right side is a sequence of symbols. The rule 1. Sequence of ASCII characters gives one definition for the structure named by the left 2. Sequence of words, separators, and white space side. For example, the rule shown in Figure l defines 3. Syntactically correct C program theuse of"+" fo r addition in C. This rule is recursive,

4. Type-correct C program defining additive-expression in terms of itself. There is one special nonterminal symbol called the 5. Statically conforming C program start symbol. At any time, a non terminal symbol can be 6. Dynamically conforming C program replaced by the right side of a rule fo r which it is the left 7. Model-conforming C program side. Beginning with the start symbol, nonterminals can be replaced until there are no more nontenninal Given a test case selected from any level, we con­ symbols. The result of many replacements is a sequence structed additional nearby test cases by randomly of terminal symbols. If the grammar describes C, the adding or deleting some character or word from the sequence of terminal symbols will fo rm a syntactically given test case. An altered test case is more likely to correct C program. Randomly generated white-space cause the compilers to issue a diagnostic or to crash. can be inserted during or aftergeneration. Both the selected and the altered test cases are valuable. A stochastic grammar associates a probability with One of the more entertaining testing papers reports each grammar rule. the results of fe eding random noise to the C run-time For level 2, we wrote a stochastic grammar fo r lex­ library • A typical library function crashed or hung on 30 emes and a Tel script to interpret the grammar,;" per­ percent of the test cases. C compilers should do better, forming the replacements just described. Whenever a but this hypothesis is worth checking. Only rarely nonterminal is to be expanded, a new random number would a tested compiler fa ced with Ievei l input execute is compared with the fixed rule probabilities to direct any code deeper than the lexer and its diagnostics. One the choice of right side. test at this level caused the compiler to crash because an In either case, at this level and at levels 3 through 7, input line was too long fo r the compiler's buffer. setting the many fixed choice probabilities permits At level 2, given lexically correct text, parser error some control of the distribution of output values. detection and diagnostics are tested, and at the same Not all assignments of probabilities make sense. The time the Jexer is more thoroughly covered . The C probabilities fo r the right sides that define a specific Standard describes the fo rm ofC tokens and C "white­ nonterminal must add up to 1.0. The probability of space" (blanks and comments). It is relatively easy ro expanding recursive rules must be weighted toward a write a lexeme generator that will eventually produce nonrecursive alternative to :�void a recursion loop in every correct token and white-space. What surprised us the generator. A system of linear equations can be was the kind of bugs that the testing revealed at this solved fo r the expected lengths of strings generated by

additive - expression addi tive-expres sion + m ltiplicative-expression

Figure 1 Rule That Defines rhe Use of"+" to r Addition in C

102 Digital T�'chnic1l Journal Vo l. lO No. 1 1998 each nonterm ina!. H� fo r some set of probabilities, all before-use and to provide the derived types of C the expected lengths are finite and nonnegative, this (struct , union, ointer). These necessary set of probabilitiesensu res that the generator does not improvements led to thousands of lines of tricky often run away. implementation detail in Te l. At this point , Tel, a nearly structureless language, was reaching its limits Increasing Te st Quality as an implementation language. At level 3, given syntactically correct text, one would At level 5, where the static semantics of the C expect to see declaration diagnostics while more thor­ Standard have been fa ctored into the generator, most oughly covering the code in the parser. At this level, generated programs compile and run. the generator is unlikely to prod uce a test program Figure 2 contains a fragment of a generated C test that will compile. Nevertheless, compiler errors were program ti·om level 5. detected . For example, one parser refused the expres­ A large percentage of level 5 programs terminate sion 1==1 = 1. abnormally, typically on a divide- by-zero operation. A The syntax of C is given in the C Standard. Using pecuJiarity of C is that many operators produce a the concept of stochastic grammar, it is easy to write a Boolean value ofO or l. Consequently, a lot of expres­ generator that will eventuaUy produce every syntacti­ sion results arc 0, so it is likely tor a division operation cally correct C translation-unit. In fa ct, we extended to have a zero denominator. Such tests are wasted. The our Tcl lexer grammar to all of C. number of wasted tests can be reduced somewhat by At level 4, given a syntactically correct generated setting low probabilities fo r using divide, fo r creating program in which every identifier is declared and all Boolean values, or fix using Boolean values as divisors. expressions are type correct, the lexer, the parser, and a Regarding level 6, dynamic standards violationscan ­ good deal of the semantic logic of the compiler are not be avoided at generation time without a priori covered . Some generated test programs compile and choosing not to generate some valid C, so instead we execute, giving the firstint eresting differential testing implement post-run analysis. For every discovered dif­ results. Achieving level 4 is not easy but is relatively fe rence (potential bug), we regenerate the same test case, straightforward fo r an experienced compiler writer. A replacing each arithmetic operator witl1 a fi.mction call, symbol table must be built and the identifieruse lim­ inside which tl1ere is a check fo r standards violations. ited to those identifiers that are already declared . The The fo llowing is a function that checks fo r "integer requirements tor combining arithmetic types in C shiftout of range." (lfwe were testing C++, we could ( int , short , char , float , double with long have used overloading to avoid having to include the and/or unsigned) were expressed grammatically. type signature in the name of the checking fu nction.) Grammar rules defining, fo r example, inc-add itive­ ·n expression replaced the rules definingadd itive-expres­ int_shl_i nt_in t ( i nt val , int arnt ) { sion. The replacements were done systematical ly tor all asser (amt >� 0 am < s i zeof (int) * 8 ) ; return val << amt ; combinations of arithmetic types and operators. To avoid introd ucing typographical errors in the defining grammar, much of the grammar itself was generated For example, the generated text by auxiliary Te l programs. The Tel grammar inter­ a << b preter did not need to be changed to accommodate is replaced upon regeneration by the text this more accurate and voluminous grammatical data. We extended the generator to implement declare- i t_sh l _int_ i n ( a , bl

* sl21 H ul15 + -- uj8 • • H l16 - ( ui17 + ++ u i2 0 * ( & ( argc <� c14 ) ? ( us23 ) < •+ argc: <= •+ s122 : -- ( ( * & * sl 4 1 ) 01 600303 7 1731044 38u * < •• ( 5u7 ) . s i t5m6 & ++ ui 5 * ( 2137167721L nsigned int I ++ ( ld2 6 ) ) & ( ( ( 0761 ) * * sl27 ? ul28 d12 DBL_EPSI LON 7e+4 & * ++ d9 * * * ++ 11 ., , dlO * d12 * (

" ld3 J * . � L * 9 . 1 - ld3 2 * ++ f33 - -. 7 3 9 2 E-6 L * " ld34 + ?. ? . 82L 1 . 91 + * -- ld35 >= H ld37 ) =- . F + ( ++ f38 ) + ++ [39 * [4 0 > ( float ) ++ f41 * 1:42 >= cl4 ++ : sc43 & ss44 1 ' IIC 13 & . 9309L ( 0 0 7 1 1 U ui 18 * * uil9 , sc46 -- ? -- ld4 7 + ld48 : •• Ld4 9 - ld48 *

++ ld50 : • + ld5l I >- 239 . 6 11 ) • - ++ ar c ( int sig ned ) argc -

++ ui 54 ) - ++ · 1 7 >= •• ul58 * argc - 9ul * +- & ul59 * ++ ul60 ;

Figure 2 Generated C Expression

Digital Technical Journal Vo l. 10 No. 1 1998 103 It� on being rerun, the regenerated test case asserts a with generated compiler directive flagsreve aled a bug standards violation (tor example, a shiftof more than in a compiler under test-it could not even compile its the word length), the test is discarded and testing con­ own header files. tinues with the next case. Two problems "'�th the generator remain: ( l)obt ain­ Results ing enough output fi-om the generated programs so Table l indicates tl1e ki nds of bugs we discovered dur­ that diffe rences are visible and (2) ensuring that the ing the testing. Only those results that are exhibited by generated programs resemble real-world programs so very short text are shown . Some of the results derive

that the developers are interested in the test results. fr om hand generalization of a problem that originally Solving these two problems brings the quality of test surfaced through random testing. input to level 7. The t1ick here is to begin generating the There was a reason fo r each result. For example, the program not fi-om the C grammar nonterminal symbol server crash occurred when the tested compiler got a translation-unit but rather !Tom a model program stack overflowon a heavily loaded machine with a very described by a more elaborate string in which some of large memory. The operating system attempted to the program is already fully generated. As a simple cl ump a gigabyte of compiler stack, which caused all example, suppose you want to generate a number of the other active users to thrash, and many of them also print statements at the end of the test program. The dumped to r lack of memory. The many disk drives on starting string of the generating grammarmight be the server began a dance of the lights that sopped up

n define P(v) prinlf(�v • -%x \ \ n ", vi the remaining fr ee resources, causing the operators to boot the server to recover. Excellent testing can m;�kc in a in () you unpopular with almost everyone. declara tion-lise st<>tement 1ist ri t-list Te st Distribution exit ( 0 ) ; Each tested or comparison program must be executed

where the grammatical definition of pri nt-list IS where it is supported. This may mean difre rent hard­ given by ware, oper;�ting system, and even physical location. There are numerous ways to utilize a network pri nt l i s t P ( jdentifier ) ; print- l i pr int- list P ( idenlifier ) ; to distribute tests and then gather the results. One par­ ticularly simple way is to use continuously running In the starting string above there are three nonter­ watcher programs. Each watcher program periodically minals fo r the three lists instead of just one fo r the examines a common ti le system fo r the existence of standard C start symbol translation-unit. Programs some particular files upon which the program can act. generated tl-om this starting string will cause output If no files exist, the watcher program sleeps fo r a while just betore exit. Because differences caused by round­ and tries again. On most operating systems, watcher ing error were uninteresting to us, we modified this programs can be implemented as command scripts. print macro to r types floa t and double to print only There is a test master and a number of test beds. a tew significantdi gits. With a little more effort, the The test master generates the test cases, assigns them expansion of print-list can be fo rced to print each to the test beds, and later analyzes the results. Each variable exactly once. test bed runs its assigned tests. The test master and test Alternatively, suppose a test designer receives a bug beds share a filespace, perhaps via a network. For each report fl·om the field,ana lyzes the report, and fixes the test bed there is a test input directory and a test output bug. Instead of simply putting the bug-causing case in directory. the regression suite, the test designer can generalize it A watcher program called the test driver waits until in the manner just presented so that many similar test all the (possibly remote) test input directories are cJses can be used to explore fo r other nearby bugs. empty. The test driver then writes its latest generated The effect of level 7 is to augment the probabilities test case into each of tlhe test input directories and in the stochastic grammar with more precise and direct returnsto its watch-sleep cycle. For each test bed there means of control. is a test watcher program that waits until there is a fi le in its test input directory. \Vhen a test watcher findsa Forgotten Inputs fileto test, the test watcher runs the new test, puts the The elaborate command-line fl ags, config files, and results in its test output directory, and returns to the environment variables that condition the behavior of watch-sleep cycle. Another watcher program called progr:�ms arc also input. Such input can also be gener­ the test analyzer waits until all the test output directo­ ated using the same toolset that is used to generate the ries contain results. Then the results, both input and test programs. The very first teston the very first run

104 Dig;it

Source Code Resulting Problem if (1.1) Constant float expression evaluated fa lse 1 ? 1 : 1/0 Several compiler crashes O.OF/O .OF Compiler crash x != 0?x!x : 1 Incorrect answer 1 == 1 == 1 Spurious syntax error -!0 Spurious type error OxOOOOOOOOOOOOOOO Spurious constant out of range message Ox80000000 Incorrect constant conversion 1E1000 Compiler crash

1 » INT_MAX Twenty-minute compile time 'ab' Inconsistent byte order int i=sizeof(i=1); Compiler crash LDBL_MAX Incorrect value (++n,O)? -- n: 1 Operator ++ ignored if (sizeof(char)+d) f(d) Illegal instruction in code generator i=(unsigned)-1.0F; Ra ndom value int f(register()); Compiler crash or spurious diagnostic int ( ...(x) ...); Enough nested parentheses to kill the compiler Spurious diagnostic (10 parentheses) Compiler crash ( 100 parentheses) Server crash ( 10,000 parentheses) digraphs (<: <% etc.) Spurious error messages alb The fa mous Pentium divide bug (we did not catch it but we could have)

output, are collected for analysis, and all the files are endless loop, the test analyzer writes the test data to deleted fr om every test input and output directory, the loop directory. Ifone of the comparison compilers thus enabling another cycle to begin. crashes or enters an endless loop, the test analyzer dis­ Using the filesystem fo r synchronization is adequate cards the test, since reporting the bugs of a compari­ to r computations on the scale of a compile-and-execute son compiler is not a testing objective. If some, but sequence. Because of themany sleep petiods, this distri­ not all, of the test case executions terminate abnor­ bution system runs efficiently but not fa st. If t11rough­ mal ly, the test case is written to the a bend directory. If put becomes a problem, the test system designer can all the test cases run to completion but the output dif­ provide more sophisticated remote execurjon. The dis­ fers, the case is written to the test diff directory. tri bution solution as described is neither robust against Otl1erwise, the test case is discarded . crashes and loops nor easy to start. It is possible to elab­ orate the watcher programs to respond to a reasonable Te st Reduction number of additional req uirements. A tester must examine each filed test case to determine ifit exposes a t:mlt in the compiler under test. The first Te st Analysis step is to reduce the test to the shortest version that qualifies fo r examination. The test analyzer can compare the output in various A watcher called the crash analyzer examines the ways. The goal is to discover likely bugs in the com­ crash directory tor files and moves to und files to a piler under test. The initial step is to distinguish the working directory. The crash analyzer then applies a test results by fa ilure category, using corresponding shortening transformation to the source of the test directories to hold the results. If the compiler under case and reruns the test. If the compiler under test still test crashes, the test analyzer writes the test data to the crashes, the original test case is replaced by the short­ crash directory. If the compiler under test enters an ened test case. Otherwise, the change is backed out

Digital Te chnical JournaJ Vo l. 10 No. 1 1998 105 and a new transformation is tried. We used 23 heuris­ This technology exposed new bugs in C compilers tic tra nstormations, including each day during its use at DIGITAL. Most of the bugs were in the comparison compilers, but a significant • Remove a statement number of bugs in DIGITAL code were fo und and • Remove a declaration corrected . • Change a constant to l Numerous special-purpose differential testing har­

• Change an identifier to l nesses were put into use at DIGITAL, each testi ng some small part of a large program. For example, the • Delete a pair of matching braces , multidimensional Fortran arrays, • Delete an if clause optimizer constant fo lding, and a new print£ fu nc­ When all the transformations have been systematically tion each were tested by ad hoc diffe rential testers. tried once, the process is started over again. The The Java API (run-time library) is a large body of process is repeated until a whole cycle leaves the relatively new code that runs on a wide variety of pbt­ source of the test unchanged . A similar process is used fo rms. Since "Write once, run anywhere" is the Java fo r the loop, :�bend,and diffdirectories. motto, the standard fo r conformance is high; however, The typical result ofthe test reduction process is to experience has shown that the standard is difficult to reduce generated C test programs of500 to 600 lines achieve. Diffe rential testing should help. vV l1:�t needs to equally useful C programs of only a fe w lines. It is to be done is to generate a sequence of calls into the not unusual to usc 10,000 or more compile opera­ API on various Java platforms, comparing the results tions during test reduction. The trade-off is using and reporting differences. Technically, this procedure many computer cycles instead of human eftort to ana­ is much simpler than testing C compilers. Chris Ro hrs, lyze the ugly generated test case. an NI IT intern at DIGITAL,wrote a system entirely in Jav:�,ga thering method signature information directly Te st Presentation out of the binary class files. This API tester may be Afterthe shortest fo rm of the test case is ready, the test used when the quality of the Java API reaches the analyzer wraps it in a command script that point where the implementors are not buried in bug reports and when there are more independent imple­ l. Reports environmental information (compiler ver­ mentations ofthe Java run time. sion, compiler flags, name of the test platform, time Differential testing can be used to increase test cov­ oftest, etc.) erage. Using the coverage data taken fr om running 2. Reports the test output or crash information the standard regression suite as a baseline, the devel­

3. Re runs the test (the test input is embedded in the opers can run random tests to see if coverage can scri pt) be increased . Developers can fi-eely add coverage­ increasing tests to the test suite using the test output as The test analyzer writes the command scripts to a an initial oracle. No harm is done because even if the results directory. recorded result is wrong, the compiler is no worse off fo r it. Ifat a later time a regression is observed on the Test Evaluation and Report generated test, either the new or the old version was The person who is managing the difte rential testing wrong. The developers are alerted and can react. John setup periodical ly runs scripts that have accumulated in Parks and John Hale applied this technology to the results directory to determine which ones expose a DIGITAL's C compilers. problem of interest to the development team . One The problem of retiri ng an old compiler in fa vor of a problem peculiar to random testing is that once a bug new one requires the new one to duplicate old bel1:1vior is fo und, it will be fo und again and again until it is so as not to upset the installed base. Differential testing fixed . This argues the case fo r giving high priority to can compare the old and the new, flagging all new the bugs exposed by differential testing. Uninteresting results (correct or not) that disagree with d1e old results. and duplicate tests are manually djscarded, and the rest Diffe rential testing can be used to measure quality. are entered into the devel opment team bug queue. Supposing that the majority rules, a million tests can be run on a set of competing compilers. The metric is Summary and Directions failed tests per million runs. The authors of the ta iled compilers can either fixthe bugs or prove the majority Differential testing, suitably tuned to the tested wrong. In any case, quality improves. program, complements traditional software testing At Compaq, diffe rential testing opportunities arise processes. It finds fa ults that would otherwise remain regularly and are often satisfied by testing systems that undetected. It is cost-effective. It is applicable to a arc less elaborate than the original C testing system, wide range ofbrge software. It bas proven unpopular which has been retired . with the developers of the tested software.

I 06 Digir�l TcchnicJI JournJI Vo l. 10 No. I 199R Acknowledgments Biography

This work was begun in the Digital Cambridge Re search Laboratory by Andy Payne based on his ear­ lier experience in testing DIGITAL Alpha hardware. The author and August Reinig continued the develop­ ment as an advanced development project in the com­ piler product group in Nashua, New Hampshire. Steve Rogers and Christine Gregowske contributed to the work, and Steve eventually placed a fr ee working prototype on the web.7 Bruce Foster managed and encouraged the project, giving the implementors ideas faster than they could be used. William M. McKeeman William McKeeman develops system sofrwat-e fo r Compaq Computer Corporation. He is a senior consulting engineer References and Notes in the Core Technology Group. His work encompasses fa st-turnaroundcompile rs, unit testing,dif terential resting, I. Information on restingis av:Ubble :�r hnp:/jww\v. remvorks. physics simulation, and the )ol'o compiler. Bill came to com/Insrirute/HorList/. DIGITAL in 1988 aftermore than 20 years in aodemia and rese:�rch.Most recently, he was a research professor at 2. B. Beizcr, Sojttl'ttre Tesli11,� und Quality Assurance (New the Aiken Computation La boratory of Harvard University, Yo rk: Van Nosmnd Reinhold, 1984 ). visiting fr om the Wang Institute Masters in Software Engineering program, where he served as Professor and Progra mming lst 3. 150/!EC 9899. 1990, Languages - c; Chair of the Faculty. He has served on the faculties ofrhe ed. (Geneva, Switzerland: International Organization University ofCalitorniaat Santa Cruz and Stanford fo r Standardization, 1990). Universitv and on ,·arious stare and universitv computer advisory committees. In addition, he has been an ACM and 4. B. Miller, "An Empirical Study of Reliability," CA CM, IEEE National Lecturer and chairman ofrhe 4th Annual vol. 33, no. 12 (December 1990): 32-44. Workshop in Microprogramming and is a member ofrhe 5. Infornution on Tci/Tk is :wailable at IFIP Wo rking Group 2.3 on Programming Methodology. http://sunscri pt.sutl.com/. Bill fo unded the Summer Institute in Computer Science progroms at Santa Cruz and Stanford and was technical 6. J. Ousterhout, 7C/ and !be 7'1.! Toolkit (Reading, Mass.: advisor to Boston Universirv fo r the W:1 ng Institute 1988 Add ison- Wesley, 1994). Summer Institute. He received a Ph.D. in computer sci­ ence ti·om Stantord University, an .M.A.in mathematics 7. Intormation on DDT distribution is available at fi·omThe George University, a B.A. in mathe­ http://steve- rogers.com/projecrs/ddt/. matics ti·om University ofC:�Iiforni:�ar Ikrkeley, :md pilot wings fi·om the U.S. Na\')'. Bill has coauthored 16 patents, 3 books, and numerous published papers in the arc:�sof General Reference compilers, programming language design, :m d program­ ming methodology. W. McKeeman, A. Reinig, :md A. Payne, "Method and Apparatus fo r Software Testing Using a Differential Testing Technique to Test Compilers," U.S. P:�tent5, 754,860 (May 1998).

Digital Technical journal Vo i. JO No. l 1998 107 ISSN 0898-901X

Printed in U.S.A. EC-P9706-18/98 12 19 1.0 Copyright © Compaq Computer Corporation