Instruction Scheduling and Executable Editing

Instruction Scheduling and Executable Editing by Eric Schnarr and James R. Lams Computer Sciences Department University of Wisconsin-Madi son 1210 West Dayton Street Madison, WI 53706 USA (608) 262-95 19 {schnarr,larus}@cs.wisc.edu Abstract floating-point SPEC benchmarks completed an average of 1.O-1.9 instructions per cycle [4]. Modem microprocessors ofer more instruction-level This large disparity between a machine’s peak and parallelism than most programs and compilers can actual performance, while frustrating for computer currently exploit. The resulting disparity between a architects and chip manufacturers, opens the exciting machine’s peak and actual performance, while frustrating possibility of low-cost or even no-cost instrumentation for for computer architects and chip manufacturers, opens the measurement, simulation, or emulation. Scheduling exciting possibility of low-cost instrumentation for reduces instrumentation’s perceived cost by putting measurement, simulation, or emulation. Instrumentation instrumentation instructions in previously unused code that executes in previously unused processor cycles is processor cycles. Instrumentation code that executes in effectively hidden. On two superscalar SPARC processors, unused cycles is effectively hidden. a simple, local scheduler hid an average of 13% of the Program instrumentation has been used for many overhead cost of pro$ling instrumentation in the SPECINT purposes, including performance measurement, computer benchmarks and an average 33% of the profiling cost in of architecture simulations, and software fault isolation and the SPECFP benchmarks. error detection [ 11. Although direct instrumentation typically incurs lower cost than alternative approaches, the increased program running time caused by instrumentation 1. Introduction is occasionally a limiting factor and is always an Modern microprocessors offer more instruction-level annoyance. At one extreme, in parallel or real-time parallelism than most programs and compilers can systems, instrumentation overhead can perturb system currently exploit. For example, the most recent generation behavior by introducing time dilation. Even for less of RISC chips-the Digital Alpha 21 164, IBIWMotorola demanding applications, the cost of program profiling or Power PC 620, SGI R10000, and Sun UltraSPARC-are 4- error detection is currently too high for production codes. way superscalar processors that execute up to four Previous instrumentation systems expended considerable independent instructions in a single cycle. Even recent effort to formulate efficient instrumentation code high-end x86 processors-the Intel Pentium Pro and AMD sequences [ 111, place them parsimoniously [2], and insert KS-are 3-way superscalars. Unfortunately, exploited instrumentation without affecting a program’s behavior [9]. instruction-level parallelism lags far behind the hardware However, few systems attempted to exploit instruction- capacity. Cvetanovic and Bhandarkar found that programs level parallelism by scheduling the instrumentation code running on a 2-way superscalar Digital Alpha 21064 could into a program. dual issue only 2650% of their instructions, which means This paper describes a simple instruction scheduler that that in 67-90% of cycles, only one instruction executed we added to the EEL executable editing library [lo]. We 131. Similarly, Diep et al. found that on a 4-way superscalar applied the scheduler to a common program Power PC 620, four integer SPEC benchmarks completed instrumentation, which profiles basic block execution an average of 1.05-1.25 instructions per cycle and three frequencies using a four-instruction sequence inserted into almost every basic block. On the 3-way superscalar 288 1072-4451/96 $5.00 0 1996 IEEE SuperSPARC processor, scheduling hides an average of machines [ 131. Our description language is more verbose, 11% of the SPECINT overhead and an average of 44% of but it also captures the syntax and semantics of machine the SPECFP overhead. On the 4-way UltraSPARC instructions and scheduling constraints for superscalar scheduling hides an average of 15% of overhead for the machines. Our algorithm is also more general, if less SPECINT benchmarks, but only 17% of overhead for the efficient, since it works for superscalar processors as well SPECFP benchmarks. This is the because EEL produces a as single pipeline machines. worse schedule than was originally found in these highly Gyllenhaal described the machine description language optimized programs. Factoring out the differences in (HMDES) used in the Illinois IMPACT compiler [6].This scheduling algorithms, we find that the integer benchmarks language, like Spawn’s description language, describes remain about the same (13%), while scheduling profile instruction encoding and scheduling constraints. HMDES instrumentation in the SPECFP benchmarks hides 27%. In describes instructions from several perspectives, which the future, these results may improve, and scheduling together provide the basic information needed by become even more attractive, with a more accurate and instruction schedulers: instruction latencies and resource aggressive instrumentation scheduler and wider usage. Spawn represents this information more concisely microarchitectures that offer further opportunities to hide as a single semantic expression for a group of instruction, instrumentation. which describes when these in:jtructionacquire and release To implement instruction scheduling, we extended EEL microarchitectural resources. Spawn descriptions also with the specific details of a processor’s microarchitecture. capture instruction semantics and binary instruction EEL records this and other architectural information using encodings. a concise, high-level specification describing the machine’s Schuette detected processor errors using unused instruction set architecture. A tool called Spawn [lo] instruction slots on a VLIW processor [15]. His control- translates these specifications into executable C++ code, flow monitoring technique inserts check operations into which becomes the part of EEL that decodes and interprets unfilled VLIW instructions. This is similar to EEL‘S machine instructions. As part of this work, we extended the rescheduling of instrumentation, except that EEL also architecture specification language to capture salient reschedules the original instructions and can handle many features of a machine’s microarchitecture and used this different forms of instrumentation. Because a VLIW has information to drive an instruction scheduler in EEL. This fixed size instructions, Schuette’s instrumentation did not paper first describes this language extension and how EEL increase code size. Whereas on a superscalar machine, uses the information to schedule instructions. Then some instrumentation code has a secondary effect of reducing timing measurements are given for scheduling performance by increasing instruction cache misses. instrumentation into the SPEC95 benchmarks. The paper is organized as follows. Section 2 describes 3. Spawn Extensions related work. Section 3 presents the extensions to Spawn Any executable editing tool depends on an accurate and shows how they represent the information necessary description of a machine’s instruction set architecture. for instruction scheduling. Section 4 describes our While most parts of EEL are inachine independent, it still experiments on scheduling program instrumentation and must disassemble, analyze, and modify binary machine presents the results. Finally, Section 5 concludes. instructions. Past experiences with executable editing tools demonstrated that the code that manipulates binary 2. Related Work instructions is simple, but tedious and difficult to debug. This paper extends several areas of previous work. Patil For example, in the profiling and tracing tool qpt [9], over and Fischer used a different form of parallelism to hide 2,000 lines of hand-written C code exist to manipulate instrumentation overhead. On a multiprocessor, the ran on binary instructions. Subtle errors in this code were difficult a second processor the code to check pointer and array to detect by inspection and often lay undiscovered for accesses [12]. The additional processor reduced the months before a new input executable exercised them. perceived overhead of error checking by 2-55%. Our To remedy this problem and make EEL more easily approach is more widely applicable since instruction-level portable across different processors, it uses a concise parallel processors are, or soon will be, ubiquitous and description of a machine’s instruction set architecture because instruction-level parallelism permits a tighter written in a high-level archittxture description language. coupling between program and instrumentation. These descriptions are short--the SPARC is 333 lines- Proebsting and Fraser described an efficient algorithm and similar in form to the tfescriptions often found in for detecting structural hazards in single pipeline architecture manuals, which makes them easier to validate. processors and a language for concisely describing these Errors overlooked in a code review are also more likely to 289 arise during testing, since a single description of an functions that can encapsulate some of the timing and instruction’s encoding and semantics underlies many resource usage characteristics. The coupling of different EEL instruction manipulation functions and architectural and resource allocation information is different instructions often share

Load more