Instruction Scheduling and Executable Editing

by Eric Schnarr and James R. Lams Computer Sciences Department University of Wisconsin-Madi son 1210 West Dayton Street Madison, WI 53706 USA (608) 262-95 19 {schnarr,larus}@cs.wisc.edu

Abstract floating-point SPEC benchmarks completed an average of 1.O-1.9 instructions per cycle [4]. Modem ofer more instruction-level This large disparity between a machine’s peak and parallelism than most programs and compilers can actual performance, while frustrating for computer currently exploit. The resulting disparity between a architects and chip manufacturers, opens the exciting machine’s peak and actual performance, while frustrating possibility of low-cost or even no-cost instrumentation for for computer architects and chip manufacturers, opens the measurement, simulation, or emulation. Scheduling exciting possibility of low-cost instrumentation for reduces instrumentation’s perceived cost by putting measurement, simulation, or emulation. Instrumentation instrumentation instructions in previously unused code that executes in previously unused cycles is processor cycles. Instrumentation code that executes in effectively hidden. On two superscalar SPARC processors, unused cycles is effectively hidden. a simple, local scheduler hid an average of 13% of the Program instrumentation has been used for many overhead cost of pro$ling instrumentation in the SPECINT purposes, including performance measurement, computer benchmarks and an average 33% of the profiling cost in of architecture simulations, and software fault isolation and the SPECFP benchmarks. error detection [ 11. Although direct instrumentation typically incurs lower cost than alternative approaches, the increased program running time caused by instrumentation 1. Introduction is occasionally a limiting factor and is always an Modern microprocessors offer more instruction-level annoyance. At one extreme, in parallel or real-time parallelism than most programs and compilers can systems, instrumentation overhead can perturb system currently exploit. For example, the most recent generation behavior by introducing time dilation. Even for less of RISC chips-the Digital Alpha 21 164, IBIWMotorola demanding applications, the cost of program profiling or Power PC 620, SGI R10000, and Sun UltraSPARC-are 4- error detection is currently too high for production codes. way superscalar processors that execute up to four Previous instrumentation systems expended considerable independent instructions in a single cycle. Even recent effort to formulate efficient instrumentation code high-end x86 processors-the Intel Pentium Pro and AMD sequences [ 111, place them parsimoniously [2], and insert KS-are 3-way superscalars. Unfortunately, exploited instrumentation without affecting a program’s behavior [9]. instruction-level parallelism lags far behind the hardware However, few systems attempted to exploit instruction- capacity. Cvetanovic and Bhandarkar found that programs level parallelism by scheduling the instrumentation code running on a 2-way superscalar Digital Alpha 21064 could into a program. dual issue only 2650% of their instructions, which means This paper describes a simple instruction scheduler that that in 67-90% of cycles, only one instruction executed we added to the EEL executable editing library [lo]. We 131. Similarly, Diep et al. found that on a 4-way superscalar applied the scheduler to a common program Power PC 620, four integer SPEC benchmarks completed instrumentation, which profiles basic block execution an average of 1.05-1.25 instructions per cycle and three frequencies using a four-instruction sequence inserted into almost every basic block. On the 3-way superscalar

288 1072-4451/96 $5.00 0 1996 IEEE SuperSPARC processor, scheduling hides an average of machines [ 131. Our description language is more verbose, 11% of the SPECINT overhead and an average of 44% of but it also captures the syntax and semantics of machine the SPECFP overhead. On the 4-way UltraSPARC instructions and scheduling constraints for superscalar scheduling hides an average of 15% of overhead for the machines. Our algorithm is also more general, if less SPECINT benchmarks, but only 17% of overhead for the efficient, since it works for superscalar processors as well SPECFP benchmarks. This is the because EEL produces a as single pipeline machines. worse schedule than was originally found in these highly Gyllenhaal described the machine description language optimized programs. Factoring out the differences in (HMDES) used in the Illinois IMPACT compiler [6].This scheduling algorithms, we find that the integer benchmarks language, like Spawn’s description language, describes remain about the same (13%), while scheduling profile instruction encoding and scheduling constraints. HMDES instrumentation in the SPECFP benchmarks hides 27%. In describes instructions from several perspectives, which the future, these results may improve, and scheduling together provide the basic information needed by become even more attractive, with a more accurate and instruction schedulers: instruction latencies and resource aggressive instrumentation scheduler and wider usage. Spawn represents this information more concisely microarchitectures that offer further opportunities to hide as a single semantic expression for a group of instruction, instrumentation. which describes when these in:jtructionacquire and release To implement instruction scheduling, we extended EEL microarchitectural resources. Spawn descriptions also with the specific details of a processor’s microarchitecture. capture instruction semantics and binary instruction EEL records this and other architectural information using encodings. a concise, high-level specification describing the machine’s Schuette detected processor errors using unused instruction set architecture. A tool called Spawn [lo] instruction slots on a VLIW processor [15]. His control- translates these specifications into executable C++ code, flow monitoring technique inserts check operations into which becomes the part of EEL that decodes and interprets unfilled VLIW instructions. This is similar to EEL‘S machine instructions. As part of this work, we extended the rescheduling of instrumentation, except that EEL also architecture specification language to capture salient reschedules the original instructions and can handle many features of a machine’s microarchitecture and used this different forms of instrumentation. Because a VLIW has information to drive an instruction scheduler in EEL. This fixed size instructions, Schuette’s instrumentation did not paper first describes this language extension and how EEL increase code size. Whereas on a superscalar machine, uses the information to schedule instructions. Then some instrumentation code has a secondary effect of reducing timing measurements are given for scheduling performance by increasing instruction cache misses. instrumentation into the SPEC95 benchmarks. The paper is organized as follows. Section 2 describes 3. Spawn Extensions related work. Section 3 presents the extensions to Spawn Any executable editing tool depends on an accurate and shows how they represent the information necessary description of a machine’s instruction set architecture. for instruction scheduling. Section 4 describes our While most parts of EEL are inachine independent, it still experiments on scheduling program instrumentation and must disassemble, analyze, and modify binary machine presents the results. Finally, Section 5 concludes. instructions. Past experiences with executable editing tools demonstrated that the code that manipulates binary 2. Related Work instructions is simple, but tedious and difficult to debug. This paper extends several areas of previous work. Patil For example, in the profiling and tracing tool qpt [9], over and Fischer used a different form of parallelism to hide 2,000 lines of hand-written C code exist to manipulate instrumentation overhead. On a multiprocessor, the ran on binary instructions. Subtle errors in this code were difficult a second processor the code to check pointer and array to detect by inspection and often lay undiscovered for accesses [12]. The additional processor reduced the months before a new input executable exercised them. perceived overhead of error checking by 2-55%. Our To remedy this problem and make EEL more easily approach is more widely applicable since instruction-level portable across different processors, it uses a concise parallel processors are, or soon will be, ubiquitous and description of a machine’s instruction set architecture because instruction-level parallelism permits a tighter written in a high-level archittxture description language. coupling between program and instrumentation. These descriptions are short--the SPARC is 333 lines- Proebsting and Fraser described an efficient algorithm and similar in form to the tfescriptions often found in for detecting structural hazards in single pipeline architecture manuals, which makes them easier to validate. processors and a language for concisely describing these Errors overlooked in a code review are also more likely to

289 arise during testing, since a single description of an functions that can encapsulate some of the timing and instruction’s encoding and semantics underlies many resource usage characteristics. The coupling of different EEL instruction manipulation functions and architectural and resource allocation information is different instructions often share common code in the necessary to permit Spawn to determine not only which description file. registers are read from and written to, but also when values Spawn [lo] is the tool that converts an architecture are read and when result values become available. On the description into the C++ code used by EEL. Spawn other hand, our experience modeling the hyperSPARC, analyzes descriptions written in SADL (Spawn SuperSPARC, and UltraSPARC processors showed that the Architecture Description Language) to detect errors and architecture semantics are easily carried over into a extract the syntactic and semantic information needed by description of different microarchitectures, despite large EEL. Spawn then reads an annotated C/C++ file and changes in the pipeline description. replaces its annotations by appropriate code produced A (micro)archtecture description, written in SADL, using information extracted from the description. The describes several aspects of a machine’s architecture: resulting C/C++ file is compiled and linked into EEL to instruction encodings, instruction semantics, architectural provide efficient functions for manipulating binary registers, and pipeline resources. The first three aspects instructions. (encodings, semantics, and registers) are described Instruction scheduling requires more detailed elsewhere [lo]. Section 3.1 describes how pipeline information than the architectural descriptions that sufficed resources and timing operators are combined with previously for EEL. In particular, scheduling requires a instruction semantics to encode details of an execution model of a machine’s microarchitecture, especially its pipeline. Section 3.2 describes how an instruction execution pipeline. Since a microarchitecture is specific to scheduler uses this information to predict pipeline each particular processor implementation, this level of behavior. detail entails writing many more descriptions, so each description should be concise and easy to modify. To 3.1. Describing an Execution Pipeline support instruction scheduling, we extended SADL to include pipeline timing and resource utilization The example in figure 2 shows how the architectural semantics and microarchitectural timing and resource information. This new information can be combined with the instruction semantic description to provide a complete usage might be described for three ALU instructions on a ROSS hyperSPARC processor [ 141. It starts by defining all map of an instruction’s actions as it moves through a the pipeline resources needed to model the timing behavior processor’s execution pipeline. of the instructions. Second, the architectural registers are Ideally, one would like to separate a microarchitecture- defined, along with aliases used to restrict the register types specific description from a general architecture description. and incorporate resource usage information. Finally, the SADL only pattially accomplishes this goal by supporting

EEL

libbfd

Architecture Machine-Dependent.C.spawn Description

Figure 1: Spawn translates an architecture description written in SADL and an annotated C/C++ file into executable C++ code, which becomes the part of EEL that decodes and interprets machine instructions.

290 instruction semantics are defined along with their resource // *** Define processor resources *** usage and timing characteristics. Microarchitecture resources, such as register ports and / / 2-way superscalar ALUs, are described by a unit name and a value unit Group 2 representing the number of copies of this resource in the // flags for dual/:;ingle issue processor. The ROSS hyperSPARC [ 141 described by this val multi is AR Group, () example is a dual-issue superscalar processor, with an val single is AR G:coup 2, () ALU for arithmetic operations and an ALU for memory address calculations (LSU). The ALUr and ALUw units // ALU and LSU capacities represent read and write ports to the register file for the unit ALU 1, ALUr 2, ALUw 1 arithmetic ALU. Similarly, the LSUr and LSUw units are unit LSU 1, LSUr 2,,Lsuw 1 ports used by the LSU. The values multi and single are used later to indicate if an instruction can be dual issued, // *** Define registers *** or if it must execute by itself. Architectural registers are described using the register // general purpose registers (GPR) declaration, and specifying the type and number of register untypedC321 R[321 registers in the register file. This example declares a // Alias for ALU rc?ad/write from GPR register file for the SPARC integer registers, where there alias signed(32) R8r[il are 32 registers each 32 bits wide. Aliases provide is AR ALUr, R[i] alternative ways to access the register file. They allow the alias signed(32) RLLw[iI registers to be accessed as an alternative type, and allow is AR ALUw, R[i] arbitrary SADL expressions to be combined with the register access. In this example, aliases are used to access // ** Define instructions *** the registers as 32 bit signed integers, and to specify the use of either an ALU read port or an ALU write port. The / / Defining operators val [ + - types are necessary later on in the description to A & I 1 disambiguate overloaded values and operators. is (\op.\a.\b. Description of the pipeline behavior of an instruction is A ALU, x:=op et b, D 1, R ALU, X) combined with the instruction’s semantic description. 0 [ add32 sub32 SADL commands A, R, AR, and D are used to describe and3 2 or3;i xor32 ] when units are acquired and released and when the // Defining shift operators pipeline advances. The command A [] val [ << >> >> 1 acquires copies of the unit, or stalls the pipeline if is (\op.\a.\b. not enough copies of the unit are available. If is A ALU, isshift., x:=op a b, omitted, it is assumed to be 1. R [] releases D 1, R ALU, x) @ [ s1132 sr132 sra32 copies of a unit. The command AR I [ []] acts like the A command, but it also // Get the second s,ource operand releases copies of the same unit after the // or immediate value instruction executes for cycles. AR is handy for val src2 acquiring a resource for a fixed amount of time, without is iflag=l ? #sinmil3 : R4rErs21 having to do explicit R’s later in the semantic expression. // Semantic description of the The command D [] advances the pipeline by // instructions add., sub, and sra cycles. In each cycle of an instruction’s sem [ add sub sra I execution, Spawn applies all unit release events for the is (\op. multi, cycle first, followed by the all the unit acquire events. D 1, sl:=R4r[rslI, s2:=src%, s2) Extra delays are inserted if there are not enough free R4w[rd] :=op sl @[+ - >> 1 resources to accommodate all the unit acquire events. An instruction scheduling algorithm must also know when registers are read from and written to. When an instruction reads a register, Spawn records in which cycle Figure 2: Semantic description of the SPARC the read occurs. Writes are more difficult, since most instructions add, sub, and sra. The resource usage and timing information are for the ROSS pipelined implementations forward values between hyperSPARC microarchitecture. instructions [7]. When a value is computed, Spawn

291 remembers the cycle in which the computation finished. Given a machine description written in SADL, Spawn When this value is written back into a register, Spawn analyzes the instruction encodings, semantics, and timing. records the cycle in which the value was computed, not Instructions with identical timing and resources allocation when the register assignment took place. Therefore in patterns are grouped together to save space in the generated SADL, the computation of an instruction’s result value C++ code. Each group records the number of cycles it must be computed in the cycle immediately preceding the takes for a member instruction to pass completely through cycle in which the value becomes available to the next the pipeline, the resources acquired in each cycle, and the instruction. resources released in each cycle. Spawn also associates a The example in figure 2 continues by describing the cycle number with every access to the integer or floating semantics and pipeline behavior of the instructions add, point register file, for both reads and writes. For reads, this sub, and sra. This is accomplished using val declarations number indicates the pipeline execution cycle when the which act like macros, and sem declarations which bind read occurs. For writes, it indicates when the written value semantic expressions to instruction mnemonics. The effect actually becomes available to subsequent instructions, not of the sem statement in this example is to bind the when the value is written to the register file. instructions add, sub, and sra, to their semantic expressions which: (1) restrict the pipeline to at most 2 3.2. Predicting Pipeline Behavior simultaneous instructions and advance the pipeline by one cycle; (2) acquire one or two ALU read ports and read the The key metric used by EEL’S instruction scheduler is register values; (3) acquire the ALU functional unit and the number of cycles that the next instruction must wait compute the instruction’s result value; (4) advance the before entering the execution pipeline. SADL describes the pipeline by one cycle and release the ALU read ports and resource usage and timing for individual instructions. This the ALU functional unit; (5) acquire an ALU write port information can describe the execution behavior of many and update the destination register; and (6) advance the superscalar processors executing a straight-line sequence pipeline by one cycle and release the ALU write port. From of instructions. this description, Spawn infers that these instructions can be Spawn passes this information to an instruction dual issued, execute in 3 cycles, read their operands in scheduler by filling in annotations in the C++ function, cycle 1, produce a value at the end of cycle 1 that pipeline-stalls, which given a sequence of instructions, subsequent instructions can use, and update the register file computes when the next instruction can start execution (see in cycle 2. Note that an extra cycle was put before reading AppendixA for an overview of this function). This the registers, since the sethi instruction produces a value function starts with a representation of the which is available at the end of cycle 0, and can be used by microarchitecture pipeline state generated by previous another instruction issued in the same cycle. instructions in the instruction sequence. The pipeline state includes history information, such as the last cycle in

Executable Edited Executable

Figure 3: EEL schedules instrumentation by first analyzing the executable and allowing the application specific code (e.g., the program profiling tool) to select and place instrumentation. A new executable is then generated, which includes the instrumentation instructions. Scheduling is performed on each basic block as it is laid out in the new executable, causing the original and new instructions to be scheduled together.

292 which each register was read and written and which units access the same address, which differs from the address are currently acquired by previous instructions. accessed by original instructions. This permits pipeline-stalls uses this state information as it simulates instrumentation loads and stores, which typically do not the pipeline execution of the new instruction and computes conflict with the original loads and stores, more freedom of how many stall cycles are needed to satisfy the RAW, movement. Since some instrumentation’s memory WAR, WAW dependencies and structural hazards. references are more constrained, there are options to limit The Spawn microarchitecture models can describe only the movement of instrumentation code. a subset of all the factors affecting the integer and floating point pipelines. Our goal has been to describe actual 4.1. Limitations on Hiding Instrumentation machines rather than hypothetical systems, so for On aggressive superscalar machines, one could hope simplicity, the Spawn descriptions only model the execution pipelines themselves. The descriptions contain that all instrumentation code ‘could be hidden in unused no information about a processor’s memory interface to pipeline stall cycles. Unfortunately, processor limitations, instruction prefetching, write buffering, or instruction and such as memory latency (a load on the hyperSPARC has a data cache behavior. pipeline-stalls does not compute one cycle latency) and resoiirce usage (stores on the stalls due to these mechanisms. On the other hand, few hyperSPARC use the LSU for 2 cycles and loads use it for 1 cycle), limit the cycles in which to hide instrumentation. scheduling algorithms take these features into account since their behavior is data-dependent (c.f. [SI). In A further problem is that in rnany programs, most basic addition, SADL does not yet describe out-of-order blocks are short and so present few opportunity to hide execution, since it was not needed for the descriptions instrumentation. Even when a,ggressively optimized, the produced so far. SPEC95 integer benchmarks have average dynamic block size of 2.9 instructions. 4. Scheduling Instrumentation Finally, scheduling instrumentation does not reduce instruction (or data) cache misses caused by EEL schedules instructions in a basic block (local instrumentation, since the additional instructions increase scheduling). The instructions in the block come either from the code size regardless of how few stalls the program the original program (for rescheduling) or a combination of incurs. Lebeck and Wood proposed a model for the the program and instrumentation code. If instrumentation instruction cache effects of program instrumentation, contains branches, the scheduler only processes the regions which reasonably accurately predicted that instrumentation of straight-line code. The scheduler uses a common two that increases a program’s size by a factor of E, will pass list scheduling algorithm [7]. The first pass starts at increase cache misses by E x 4% [ 113. Profiling increases the end of the block and works backwards to compute the a program’s text size by a factor of 2-3. Fortunately, many length (in cycles) of the dependence chain between every programs have low instruction cache miss rates, so the instruction and the end of the block. This computation only increase is not significant. considers the stalls required between data dependent instructions. 4.2. Scheduling Profiling Instrumentation The second pass starts at the beginning of the block and We scheduled QPT2’s slow profiling instrumentation works forward, to order instructions with list scheduling. [2], which adds 4 instructions--set immediate, load, add, The instruction with the highest priority of any instruction that can be legally scheduled at this point is put next in the and store-into most basic blocks in a program. This code schedule. An instruction’s priority is determined primarily can execute in 4 cycles on both SuperSPARC and by how few stalls it requires before it can start execution UltraSPARC processors. Blocks with a single instrumented (as computed by pipeline-stalls). If two instructions single-exit predecessor or a single instrumented1 single- entry successor are not instrumented. The SuperSPARC require the same number of stalls, the instruction farthest from the end of the block, using the metric computed in the experiments ran on dual processor Sun SPARCstation 20 equipped with 5OMhz SUN SuperSPARC [ 171 processors, first pass, is scheduled first. If two instructions still have running Solaris 2.4. The UltraSPARC experiments ran on the same priority, the instruction listed earlier in the original code sequence is chosen under the assumption that an 12-slot Ultra Enterprise 4000/5000 with 167Mhz Sun UltraSPARC processors [16] running Solaris 2.5. The test the instructions were previously scheduled. programs were compiled with the Sun C and Fortran When computing data dependencies in both passes, the scheduler conservatively assumes that loads and stores compilers (version 4.0), using the options “-fast -x04 - xarch=v8 -xchip=super -xdepend -dn” and “-fast -x04 - from the original code access the same address. We also xarch=vS -xchip=ultra -xdepentl -dn” for the SuperSPARC assume that loads and stores in instrumentation code and UltraSPARC respectively. We did not use the compiler

293 Table 1: Slow profiling instrumentation on the UltraSPARC. Avg. B5 Size is the (dynamic) average basic block size (instructions). Uninst. Time is a program’s un-instrumented execution time (seconds). Inst. Time is a program’s instrumented, but unscheduled execution time. The number in parentheses is the ratio to the un-instrumented time. Sched. Time is the instrumented time after scheduling. Finally, % Hidden is the fraction of instrumentation overhead hidden by scheduling.

Table 2: Slow profiling instrumentation on the UltraSPARC, with original instructions first rescheduled by EEL.

294 Table 3: Slow profiling instrumentation on the SuperSPARC. options that generate UltraSPARC-specific code, since our of hiding overhead is lost by de:-scheduling in the floating instruction scheduler is currently configured for the point benchmarks, with their highly optimized, long basic SPARC version 8 instruction set. In all cases, we ran the blocks. programs using the scripts that come with the SPEC95 To factor out the effect of EEL's scheduler, we benchmarks, specifying the longest running ("ref') inputs. performed a variation of the previous experiment on the Table 1 contains measurements for the UltraSPARC. UltraSPARC. First EEL's scheduler reschedules the Scheduling hides an average of 15% of slow profiling benchmarks, without adding amy instrumentation. Then overhead for integer benchmarks, and 17% for floating slow profiling instrumentation was added, but not point benchmarks. The integer programs execute many scheduled among the original instructions. Finally both the small basic blocks (average 2.9 instructions per block), so added instrumentation and original instructions were there is little opportunity to schedule added scheduled together. Table 2 cclntains the results of this instrumentation among the original program instructions. experiment. The results for integer benchmarks are the This is compounded by the fact that for purely integer similar to before (average 13% hidden), but the results for codes, the UltraSPARC can launch at most two instructions floating point benchmarks are better. Scheduling now hides in parallel, instead of its maximum four instructions per 27% of instrumentation overhead for the floating point cycle. benchmarks, with no significant outliers. Neither of these problems should affect floating point Table 3 contains measurements for benchmarks programs, so it is at first surprising that they performed no compiled for and run on the SuperSPARC, and the results better than the integer benchmarks. In fact, scheduling can are similar to those obtained for the UltraSPARC. hide a greater percentage of the instrumentation overhead Scheduling hides an average of 11% of profiling overhead for floating point programs, but performance is lost due to for the integer benchmarks, anti 44% of the overhead for other factors. EEL's instruction scheduler is quite simple, floating point benchmarks. as it only schedules within basic blocks using a simple heuristic algorithm. It does not perform as well as the 5. Conclusion optimizers in the SUN C and -Fortran compilers that compiled the benchmarks. Since EEL schedules original This paper investigated the benefits of combining instruction scheduling with executable editing. program instructions as well as added instrumentation, it can produce a worse schedule for the original instructions Measurements on the SPEC95 benchmarks show that on a 3-way superscalar machine scheduling can hide an average while trying to hide the added overhead. Hence the benefit

295 of 1144% of the overhead introduced by program [4] Trung A. Diep, Christopher Nelson, and JohnPaul Shen. profiling instrumentation. On a more modem, 4-way Performance Evaluation of the PowerPC 620 Microarchitec- superscalar, the scheduler can hide an average of 16-17% ture. In Proceedings of the 22nd Annual International Sym- of the profiling overhead. A sub-optimal scheduling posium on Computer Architecture, pages 163-174, June 1995. algorithm limits the visible improvement from scheduling instrumentation, since it may reschedule original program [5] Linley Gwennap. Intel’s P6 Uses Decoupled Superscalar De- sign. Report, 9(2):9-15, February 16 1995. instructions poorly. Already, the benefits of scheduling program [6] JohnC. Gyllenhaal. A Machine Description Language for Compilation. Master’s thesis, Department of Electrical Engi- instrumentation are clear enough that existing and future neering, University of Illinois, Urbana IL, September 1994. instrumentation systems should adopt this simple technique to reduce instrumentation overhead. In addition, [7] John L. Hennessy and David A. Patterson. Computer Archi- tecture: A Quantitative Approach. Morgan Kaufmann, 1990. this approach promises to help reduce the cost of error checking, such as array bounds or null pointer tests, to a [8] Daniel R. Kems and Susan J. Eggers. Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain. level at which it may routinely be included in production In Proceedings of the SIGPLAN ’93 Conference on Program- code. ming Language Design and Implementation (PLDI), pages 278-289, June 1993. Acknowledgments [9] James R. Larus. Efficient Program Tracing. IEEE Computer, Many thanks to Gurindar Sohi for the use of the 26(5):52-61, May 1993. UltraSPARC machine used to collect numbers for an early [lo] James R. Larus and Eric Schnarr. EEL: Machine-Indepen- version of this paper, and to Bob Roessler for arranging the dent Executable Editing. In Proceedings of the SIGPLAN ’95 loan of the machine used in our initial UltraSPARC Conference on Programming Language Design ana‘ Imple- experiments. Mark Hill suggested that the scheduler model mentation (PLDI), pages 291-300, June 1995. the L2 cache in the UltraSPARC. Vinod Grover and Kurt [ 111 Alvin R. Lebeck and David A. Wood. Active Memory: A Goebel provided helpful comments on a draft of this paper. New Abstraction for Memory-System Simulation. In Pro- ceedings of the 1995 ACM Sigmetrics Conference on Mea- This work is supported in part by Wright Laboratory surement and Moakling of Computer Systems, pages 220- Avionics Directorate, Air Force Material Command, 230, May 1995. USAF, under grant #F33615-94-1-1525 and ARPA order [12] Harish Patil and Charles Fischer. Efficient Run-time Moni- no. B550, an NSF NYI Award CCR-9357779, NSF Grants toring Using Shadow Processing. In 2nd International Work- CCR-9101035 and MIP-9225097, DOE Grant DE-FG02- shop on Automated and Algorithmic Debugging (AADEBUG 93ER25 176, and donations from Digital Equipment ’95), St. Malo, France, May 1995. Corporation and . The U.S. Government [13] Todd A. Proebsting and Christopher W. Fraser. Detecting is authorized to reproduce and distribute reprints for Pipeline Structural Hazards Quickly. In Conference Record Governmental purposes notwithstanding any copyright of the Twenty-First Annual ACM Symposium on Principles of notation thereon. The views and conclusions contained Programming Lunguages, pages 280-286, Portland, Oregon, herein are those of the authors and should not be January 1994. interpreted as necessarily representing the official policies [ 141 ROSS Technology, Inc. SPARC RISC User’s Guide: hyper- or endorsements, either expressed or implied, of the Wright SPARC Edition, September 1993. Laboratory Avionics Directorate or the U.S. Government. [ 151 Michael A. Schuette. Exploitation of Instruction-LevelParal- lelism for Detection of Processor Execution Errors. Ph.D. References thesis, Electrical and Computer Engineering, Carnegie Mel- Ion University, Pittsburgh PA, January 1991. A1i-Reza Geoff Langdale’ LuccO’and [ 161 SUN Microsystems,Inc. U1traSPARC-I User’s Manual, AU- Robert Wahbe, “Efficient and Language-Independent Mobile gust 1995. Programs,” in Proceedings of the SIGPLAN ‘96 Conference on programming Language Design and Implementation (PL- [ 171 Texas Inst”ents. SuperspARC User’s Guide, October DI), pages 127-136, May 1996. 1993. Thomas Ball and James R. Larus, “Optimally Profiling and Tracing Programs,” ACM Transactions on Programming Languages and Systems (TOPLAS),vol. 16, no. 4, July 1994, pages 13 19-1360. Zarka Cvetanovic and Dileep Bhandarkar. Characterization of the Alpha AXP Performance Using TF’ and SPEC Work- loads. In Proceedings of the 21st Annual International Sym- posium on Computer Architecture, pages 60-70, April 1994.

296 Appendix A: Function pipeline-stalls

unsigned long pipeline-stalls(unsigned long cycle, // cycle when mi starts executing Unitvalues &state, // current pipeline state const mach-inst* mi) // next instruction ( unsigned long stalls = 0; {{INST mi CATEGORY any:: // All Spawn annotations now refer to instruction mi. unsigned long gid = {(GROUP)); // mi's timing group long ii; // Trace[] records the resources used by this instruction in the current cycle. unsigned long trace[((UNITS COUNT))]; for(ii=O; ii<({UNITS COUNT)); ++ii) trace[ii] = 0; // Search for stalls unsigned long mi-cycle = 0; // current cycle in mi's pipeline while(mi-cycle <= {(GRP gid CYCLES)]) ( // Units[] records the number of unused resources in this cycle // after allocating resources for all previous instructions. unsigned long* units = state[cycle]; boo1 advance = true; // Test for structural hazzards. if(advance) for(ii=O; ii<((GRP gid ACQUIRE mi-cycle COUNT)); ++ii) ( unsigned long unit-val = units[{(GRP gid ACQUIRE mi-cycle UNIT ii))] - trace[((GRP gid ACQUIRE mi-cycle UNIT ii))]; if(unit-va1 < {(GRP gid ACQUIRE mi-cycle NUM ii))) ( advance = false; break; 1 1 // Test for RAW hazzards. if (advance) for(ii=O; ii<{(R READ COUNT)); ++ii) if({(R READ ii TIME)) == mi-cycle && cycle < state.write-cy[Ol[{(R READ iilll) ( advance = false; break;

//... Similar tests for WAR and WAW hazzards. ++cycle; // Advance the execution pipeline for previously schiduled instructions. // Advance instruction pipeline or record the stall if (advance) ( for(ii=O; ii<{(GRP gid ACQUIRE mi-cycle COUNT)); ++ii) trace[({GRP gid ACQUIRE mi-cycle UNIT ii))] += {(GRP gid ACQUIRE mi-cycle NUM ii)); ++mi-cycle; for(ii=O; ii<({GRP gid RELEASE mi-cycle COUNT)); ++ii) trace[((GRP gid RELEASE mi-cycle UNIT ii))] -= ((GRP gid RELEASE mi-cycle NUM ii)); ) else ++stalls; 1;; 1) return stalls; 1

297