Pipeline Behavior Prediction for Superscalar Processors by Abstract Interpretation

Jijrn Schneider* Christian Ferdinand

Universitat des Saarlandes AbsInt Angewandte Informatik GmbH Fachbereich Informatik Universitat des Saarlandes Postfach 15 11 50 Starterzentrum - Gebaude 45 D-66041 Saarbriicken, Germany D-66123 Saarbriicken, Germany Phone: +49 681 302 3434 Phone: +49 681 8318317 Fax: +49 681 302 3065 Fax: +49 681 8318320 jsQcs.uni-sb.de [email protected] http://www.cs.uni-sb.de/wjs http://www.AbsInt.de

Abstract combinations of input values must be considered. We use static program analysis to compute the WCET of For real time systems not only the logical function is programs. Our method is based on abstract interpreta- important but also the timing behavior, e. g. hard real tion. time systems must react inside their deadlines. To guar- To obtain sharp bounds for the WCET it is necessary antee this it is necessary to know upper bounds for the to consider the features of the target processors. Mod- worst case execution times (WCETs). The accuracy ern processors gain speed through: The use of cache of the prediction of WCETs depends strongly on the memories and pipelines, and the parallel execution of ability to model the features of the target . instructions. Cache memories, pipelines and parallel functional In this paper the prediction of the behavior of pipe- units are architectural components which are respon- lines combined with parallel execution on a superscalar sible for the speed gain of modern processors. It is not processor is presented. Initial results of a prototypical trivial to determine their influence when predicting the implementation for the SuperSPARC I processor [18] worst case execution time of programs. are presented. This paper describes a method to predict the be- havior of pipelined superscalar processors and reports 2 Super-scalar Pipelines initial results of a prototypical implementation for the SuperSPARC I processor. 2.1 Principle of Pipelines The idea of pipelining is to overlap the execution of in- 1 Introduction structions. This is accomplished by dividing the execu- The correctness of a hard real-time system depends not tion of instructions into several steps (pipeline stages). only on the logical functionality, the programs must also The ideal case of full overlapped instructions can not satisfy the temporal requirements dictated by the phys- always be reached. Pipeline bubbles can occur in real ical environment. To analyze the schedulability of a pipelines, when the pipeline stalls. Situations, which given set of tasks, upper bounds for the worst case exe- can lead to pipeline stalls are called pip&e hazards. cution time (WCET) of the tasks are required. In gen- There are three types of hazards. Structural hazards are eral it is not possible to determine these upper bounds caused by resource conflicts, data hazards are caused by measurements or simulation, because all possible by data dependences, and control hazards are caused by control flow changes (e.g. branches). *Partly supported by DFG (German Reskarch Foundation), Trans- ferbereich 14 2.2 Superscalar Execution Permission to make digital or hard copies of 811 or part of this work for personal or classroom use is granted without fee provided that A measure for the throughput of a processor is the CPI copies are not made or distributed for profit or commercial advan- tage and that copies bear this notice and the full citation on the first page. value.(Cycles Per Instruction). To copy otherwise, to republish, to post on servers or to With the concept of pipelining it is possible to reach redistribute to lists, requires prior specific permission and/or e fee. LCTES ‘99 5/99 Atlanta, GA, USA at best a CPI value of 1. To increase the throughput, 0 1999 ACM l-56113~136s4/99/0006...$5.00

35 i. e. decreasethe CPI value further it is necessary to in- troduce parallel execution. The parallelization can be Decode of Floating point operations done statically or dynamically. The static approach is +-I-= used in e. g. VLIW (Very Large Instruction Word) pro- FRD Read of FP registers cessors [8], where the compiler is responsible for select- ing the instructions to execute simultaneously. In su- FE Execution of FP instructions perscalar processors the dynamic approach is used, i. e. the processor selects the instructions to be processed FL Rounding and normalization of results concurrently during run time. FWB Write back of floating point results Stage Tasks FO l Fetch of up to 4 instr. in case of cache hit, or up to 8 instr. from memory Table 2: Stages of the floating point pipeline.

Fl l Filling of the prefetch queue data dependences, resource conflicts, and control flow changes. The data dependences and the resource con- DO l Selection of instructions from prefetch flicts can occur between the available instructions and queue to form a group of concurrent between these and already grouped instructions. executable instructions The SuperSPARC I has a main or integer pipeline and a floating point pipeline. The pipeline stages of Dl l Assignment of functional units the main and floating point pipeline are displayed in and register ports to instructions Table 1 and Table 2. b Computation of branch destination addr. The processing of a floating point instruction starts l Reading of address registers in the first part of the main pipeline and continues in for memory accesses the floating point pipeline. The transfer from the inte- ger pipeline to the floating point pipeline is done in EO D2 l Reading of operand registers CFD). l Computation of virtual ‘The execution time of some floating point instruc- addressesfor memory references tions depends on their operands. If the operands or the result are subnormal values the computation takes EO l Execution of computations in longer (see [S]). Arithmetic Logic Unit 1 (ALUl) l Start of data cache accesses 3 Pipeline Analysis by Abstract Interpretation 0 Transfer of floating point instructions to the FPU Abstract interpretation is a well developed theory of static program analysis [3]. It is semantics based, El l Computations, that depend on thus supporting correctness proofs of program analy- instruction in same group are ses. Abstract interpretation amounts to performing a executed in the cascaded ALU2 program’s computations using value descriptions or ab- l Completion of data cache accesses stract values in place of concrete values. One reason for using abstract values instead of con- WB l Write back of integer results crete ones is computability: to ensure that analysis re- sults are obtained in finite time. Another is to obtain Table 1: Stages of the integer pipeline. results that describe the result of computations on a set of possible (e.g., all) inputs. The behavior of a program (including its pipeline behavior) is given by the (formal) semantics of the pro- 2.3 Example: SuperSPARC I gram. To predict the pipeline behavior of a program, The SuperSPARC I is a superscalar RISC processor, its “collecting semantic@ approximated. The collect- compatible with the SPARC version 8 architecture [19]. ing semantics gives the set of all program (pipeline) It is capable of grouping up to three instructions to ex- states for a given program point. Such information ecute them concurrently. The grouping of instructions combined with a path analysis can be used to derive is a dynamic process. The decision which instructions WCET-bounds of programs. This approach has suc- are grouped together depends on the availability of in- cessfully been applied to predict the cache behavior of structions in the prefetch queue or the cache, and on ‘In [3], the term “static semantics” is used for this.

36 programs [4, 6, 201. processor that can execute a group of up to N instruc- The approach works as follows: in a first step, the tions concurrently. “concrete pipeline semantics” of programs is defined. For superscalar processors it is not sufficient to build The concrete pipeline semantics is a simplified seman- a kind of static reservation table, since the assignment tics that describes only the interesting aspects of the of resources to pipeline stages of instructions can change pipeline behavior but ignores other details of execution, dynamically during the grouping process. Additionally e. g. register values, results of computations, etc. In this the state of some resources must be provided. way each real pipeline state is represented by a concrete pipeline state. In the next step, we define the “abstract 4.1 Concrete Pipeline Semantics pipeline semantics” that “collects” all occurring con- Definition 4.1 (resource association) crete pipeline states for each program point. Let R = {q , . . . , T,} be the set of resource types and From the construction of the abstract semantics fol- resources of the processor. Let PS be the set of pipe- lows that all concrete pipeline states that are included line stages. A pair (s, {rj, , . . . , rj,, }) with s E PS in the collecting semantics for a given program point are and rj, 7. . . 7rj,, E R is a resource association. R = also included in the abstract semantics [4]. An abstract (PS x 2R) denotes the set of all resource associations. pipeline state at a program point may contain concrete 0 pipeline states that cannot occur at this point, due to infeasible paths. This can reduce the precision of the Definition 4.2 (resource association sequence) analysis but doesn’t affect the correctness (see [4]). A sequence F E R = R* is a resource association se- The computation of the abstract semantics has been quence. Let “.” be the concatenation operator for re- implemented with the help of the program analyzer gen- source association sequences. 0 erator PAG [14], which allows to generate a program an- alyzer from a description of the abstract domain (here, Definition 4.3 sets of concrete pipeline states) and of the abstract se- A resource demand sequence is a resource association se- mantic functions. quence describing the statically given resource demand of an instruction (type). 4 Pipeline Semantics A resource allocation sequence is a resource association sequence describing an actual assignment of resources Before the pipeline semantics is presented, which allows to an instruction. It depends on the current state of the to analyze the behavior of superscalar pipelines, we try pipeline. cl to motivate its design. The pipeline semantics must at Resource allocation sequences always start with the least allow us to detect stalls and to model the selection resource allocation for the current pipeline stage, i.e. of instructions for concurrent execution. Therefore, the resource allocations of previous pipeline stages are re- detection of hazards must be possible and the informa- moved, when the instruction advances through the pipe- tion which instructions are available must be provided. line. To detect structural hazards the resource usage of instructions has to be known. For most modern proces- Example 4.1 sors, memory access is too slow to cause data hazards. Consider an instruction with the following resource de- Dependences over cache accesses can be treated within mand sequence a data cache/store buffer analysis [6]. It is assumed that data hazards occur only in case of dependences be- tween data registers. They can be detected by modeling read and write ports of registers as resources. Control hazards can also be detected with information about This instruction needs the resource. types or re- resource usage. An instruction that changes the con- sources {r,, , . . . , rZb} in pipeline stage sr, before trol flow must write to a special resource, e. g. the nezt {r vl, . . . , ry, } are needed in stage ss. The instruction program counter register. requires no resources in stage ~3. It stays two cycles in To know which instructions are available the cache s4 and needs {r,, , . . . , rI, } both times before it con- behavior and the state of the prefetch queue should be tinues through the remaining stages. cl known. The cache behavior is predicted by a sepa- rate cache analysis [5]. To model the prefetch queue The resource demand sequence of an instruction’ the pipeline semantics must allow the description of re- depends only on the instruction type (e.g. ADD or sources with their own state. DIV) and the operand types (e. g. register or immediate Our approach is based on the pipeline analysis value). framework in [4]. We consider an in-order superscalar How the resource demand of an instruction can be satisfied depends on the actual situation in the pipeline.

37 The initial resource allocation sequence of an instruc- The update function U is extended to sequences of tion can differ from the resource demand sequence of instructions: the type of this instruction. Where multiple resources of a resource type are available a particular instance is WP, (il ,... ,ik)) =U(...U(U(p,il),i2) ,... ,ik) chosen. A concrete pipeline state describes the occupancy The pipeline behavior of a path (ir, . . . ,il) in the of the pipeline stages by instructions, the current and control flow graph is given by applying U to the empty future resource allocations for these instructions and the pipeline state p, and the concatenation of all instruc- state of some special resources, e. g. the prefetch queue. tions paired with the appropriate processor state infor- mation along the path: Definition 4.4 (concrete pipeline state) A concrete pipeline state p consists of the resource al- location sequences of the up to N * IPSj (up to N in- structions per pipeline stage) instructions, which are 4.3 Abstract Semantics currently in the pipeline, R# = ((a)N)IPsl, and the There are only finitely many concrete pipeline states state SR of some resources, i.e. p = (r#, SR). P denotes and their representation is usually small. Therefore, the set of all possible concrete pipeline states’. 0 sets of concrete pipeline states2 can be used as the do- The concrete pipeline state changes when a new in- main for our abstract interpretation and do not need struction enters the pipeline. The resulting new pipeline space efficient descriptions of sets of concrete pipeline state depends on the previous pipeline state, on the (re- states. source demand sequence of the) new instruction and on Definition 4.8 (abstract pipeline state) the states of other processor parts (e. g. the state of the An abstract pipeline state $ E P is a set of concrete cache memory). pipeline states. ? = 2p denotes the set of all abstract Definition 4.5 (update function) pipeline states. 0 Let IS be the instruction set of the processor. The The abstract version of the concrete pipeline update (concrete) update function U : P x IS + P models function is a canonical extension of the concrete pipeline the effect on the concrete pipeline state caused by the update function to sets: U@, i) = {U(p, i) 1p E 6) entrance of a new instruction into the pipeline. Cl Definition 4.9 (pipeline join function) Definition 4.6 (cycles function) A join function combines two abstract pipeline states. The cycles function C : P x IS + lNs computes the The join function is given by the least upper boundAof number of cycles needed by a new instruction to enter the abstract domain. The pipeline join function ,7 : the pipeline, i.e. the number of cycles needed to reach P x P + P is set union: J($i ,&) = $1 U $2 0 the pipeline state U(p, i). 0 The join function is used to union the abstract pipe- Definition 4.7 (empty function) line states of two or more merging paths. To combine The pipeline empty function & : P -+ lNs computes the more than two values the join function is extended: number of cycles which are needed to flush the pipeline, i.e. the numbers of cycles needed to reach the empty pipeline state PC. Cl 4.4 Pipeline Analysis 4.2 Control Flow Representation In order to solve the pipeline analysis for a program, one Programs are represented by control flow graphs con- can construct a system of recursive equations from its sisting of nodes and typed edges. The nodes represent control flow graph. In the program analyzer generator instructions. Each instruction is statically assigned a PAGthis is only done implicitly. resource demand sequence, i. e. there exists a mapping The variables in the equation system stand for ab- from control flow nodes to resource demand sequences: stract pipeline states for program points. For every con- res, : V + R. trol flow node k, representing instruction ik there is an

‘This set is finite. The length of the resource allocation sequences equation $k = U(pred(k), ik). If k has only one direct is limited, because the number of pipeline stages is finite and the predecessor k’, then pred(k) = &I. If k has more than number of repetitions of pipeline stages is also limited. The sets of possible states of resources are finite. ‘The domains of the abstract semantics and the collecting seman- tics are equal. But an abstract pipeline state may contain concrete pipeline states, that do not occur in the respective collecting seman- tics, due to infeasible paths.

38

newgrp-stat := closed; In some special cases the documentation of the Su- fi perSPARC I was insufficient for our purposes. There- fl fore pessimism is introduced in the analyzer. This pes- grp-stat := newgrp-stat; simism results in pessimistic concrete pipeline states. return(p); A pessimistic concrete pipeline state contains more re- sources than the instruction actually uses or allocates The grouping process is modeled by a set of rules. resources for more pipeline stages than are actually oc- These rules are based upon the resource demand se- cupied by the instruction. quences of available instructions and the resource allo- cation sequencesof instructions which are already fur- Example 4.5 ther in the pipeline. Consider the following SuperSPARC I instruction se- quence: Example 4.2 One of the rules that are applied in res-conflo (po- A 8000: ADD Rl,R2,R3 sition (A)) in the update function example says that B 8004: ADD Rs,Rs ,Rq two instructions that access the data cache cannot be c 8008: LD CR4+41,R3 grouped together. D 8012: SUB R6,h,b A rule applied in data-dep() (position (B)) says that The resource demand sequences for these instruc- instructions which accessa read port of a data register tions are shown in Table 3. Rg , R,” are the read and in stage Dl (i.e. load or store instructions) cannot be write ports of data register 2. DC stands for the data grouped with a preceding instruction that uses the write cache and ALU for the resource type arithmetic logic port of this particular data register. unit. The rule applied at position (C) says that an in- struction which uses the write port of the next program counter register (e.g. branches) is always the last, in a group. These rules prevent problems arising from resource conflicts, data dependences, and control flow changes respectively. 0 Table 3: Resource demand sequencesof Example 4.5. While the application of all these rules can be trig- gered by the resource allocations of instructions, it is Figure 1 shows the dynamic change of the resource not always necessary to exhaustively search the resource allocation sequences and the prefetch queue state for allocation or resource demand sequences for a trigger- this example. The prefetch queue state is modeled by ing resource. From the resource demand sequences of a sequence of instruction addresses (shown in the box the instruction types some necessary preconditions can under the appropriate call to the update function). The be precomputed. first address is on top of the queue. In pipeline state pl a new group is created, which Example 4.3 contains just A. The prefetch queue is filled from the Consider the first rule of Example 4.2. Only the re- cache and A is dequeued. In p2 B joins the group of source allocation sequence of the various types of load A. A dynamic change of the resource allocation against and store instructions contain the data cache. There- the statically assigned resource demand for instruction fore, this rule can be triggered just by looking at the B occurs. Since the result of B depends on the result of instruction type. 0 A, B must use the cascaded ALU (AL&) and not ALU1 . The stall behavior of the SuperSPARC I is also mod- B is dequeued from the prefetch queue. In ps a pipeline eled by rules. bubble is inserted and a new group is started for C. The bubble is necessary, since the LD instruction depends Example 4.4 on a result of the cacaded ALU in El, which otherwise One of the rules applied in data-hazard0 (position could not be forwarded. A and B advance four pipeline (D)) in the update function example says that a pipe- stages (two cycles). C is dequeued from the prefetch line bubble has to be inserted between a group wherein queue. In p4 D starts a new group since it depends on C the write port of a data register R, and the ALU in which forwards its result from El. D is dequeued from stage El is used and a group which uses the read port the prefetch queue, which is empty then. 0 of R, in stage Dl. cl

40 call to update/ resource allocation sequences 6 Practical Experiments prefetch queue A starts new group In this section the results of a first implementation of P1 = FO’FlDCDl D2 EO ElpB the pipeline analysis are presented. We have chosen the ti(pc, A, hit) A - - - - R$,R$Ri,R$ - Ry SuperSPARC I processor as target. The implementa- 18004, 8008, 80121 ALU1 I 1 tion has been done with the program analyzer gener- ator PAG, i.e. the pipeline analyzer is generated from grouping A and B a description of the semantic functions. For the sake P2 = of space, this description isn’t shown here (for further U(PI, B, hit) information see [18]). 18008,8012] The analyzer takes as input the control flow graph of a program and the results of the cache analysis [5]. We conservatively assume that memory references that C causes stall are not classified as hits by the cache analysis are cache P3 = missesr. The output of the analyzer is a mapping map WPZ, C, hit) of instruction/context pairs to pairs of clock cycles. The @ first element of a clock cycle pair is the result of the cycles function applied to the abstract pipeline state as shown in Section 4.4. The second element is either the result of the empty function applied to the abstract pipeline state for exit instructions2 or zero for all other D starts new group instructions. A context represents the execution stack, i.e. the P4 = trace of function calls and loops along the corresponding WPS, D, hit) path in the control flow graph to the instruction. Let El IC be the set of all instruction/context pairs.

map : IC-,lNt-JXlNl-J

In general it is impractical to regard all possible con- texts (paths) of a program. The cache behavior (and thereby the pipeline behavior) often exhibits significant differences caused by initialization effects. Therefore, Figure 1: Application of the concrete update function first iterations of each loop from further iterations and on Example 4.5 first calls to each (recursive) function from recursive calls are distinguished. Instead of distinguishing all paths, path classes are considered for which similar be- 5 Out of Order Execution havior is expected. Processors with out of order execution are more flexible This is realized by a generalization of well known in choosing instructions for parallel execution. If the interprocedural analysis schemes [15]. The approach is foremost instruction in row doesn’t fit they will try the called VIVU (Virtual Inlining of non-recursive func- next one. Of course out of order pipelines too, have to tions and Virtual Unrolling of loops and recursive func- check for hazards. tions) . As stated above our semantics is suitable for super- The frontend of the analyzer reads a Sun SPARC ex- scalar pipelines with in order execution. As a matter ecutable in a.out format. The implementation is based of fact the above semantics is also suitable for pipelines on the EEL library of the Wisconsin Architectural Re- with out of order execution. The decision of an out of search Tool Set (WARTS). order execution machine to choose a subset from the The worst case execution profile of a program deter- available instructions is based on the resource require- mines how often each instruction/context pair is max- ments of the instructions. Since our semantics is based imally encountered during the execution of a program. on resources, no changes are needed to support out of By combination with the results of our pipeline anal- order execution. ysis the worst case number of clock cycles needed to Nevertheless the concrete semantic functions, and execute the input program can be estimated. For our the cycles and empty function must be adapted for a ‘For the Sup&PARC I this assumption is safe, since cache misses new target processor. don’t lead to accelerations. ‘An exit instruction is a last instruction’of the program.

41 experiments “exact” execution profiles are used instead prediction by the cache analysis is shown in the third of deriving them via a path analysis. This allows us to row. The improvement factor of the combined instruc- assessthe effectiveness of the pipeline analysis without tion cache/pipeline analysis against an optimal instruc- the influence of possibly pessimistic path analyses. The tion cache analysis can be found in the fourth column. profilers used to create the profiles are produced with The last column shows the improvement of our com- the help of qpt2 (Quick program Profiler and Tracer) [I] bined cache/pipeline analysis against the best possible that is part of the WARTS distribution. An execu- results of an analysis without cache/pipeline behavior tion profile maps instruction/context pairs to execution prediction. counts: Program CPI Cache Additional Combined profile : IC + IN0 Analysis improv. improv. factor by improv. We have chosen four small programs (see Table 4) factor Pipeline A. factor to test our implementation. lsimple 0,556 3,24 7,19 23,29 Program Comment matmult 3,436 3,24 1,16 3,75 lsimple simple for loop fibr 1,800 3,24 2,22 7,19 matmult matrix multiplication 3,138 3,24 1,27 4,ll fibr recursive computation of the 23. Pi Fibonacci number Table 5: Improvements by the pipeline analysis. Pi Iapproximative computation of Pi Table 4: List of test programs 7 Related Work Healy, Whalley and Harmon developed an approach [7] to predict worst case execution times in the presence of 6.1 Improvements by the pipeline analysis instruction caches and simple pipelines. Their analyzer To show the effectiveness of our pipeline analysis we predicts the WCET of a user specified program part. compare the results of the combined instruction cache/ The results of a preceding cache analysis are used. Their pipeline analysis1 with an (virtual) analysis without target processor is a MicroSPARC I. Since the pipeline cache and pipeline behavior prediction, and with our of this processor is pretty simple they can limit the used cache analysis. The CPI (Cycles Per Instruction) values resources to registers and pipeline stages. of the different analyses are compared.2 For the Super- For each instruction type several informations must SPARC I the best CPI value that an analysis without be presented to the analyzer. These are the first and cache and pipeline behavior prediction can reach is 133 the last pipeline stage from or to which forwarding is (assuming no overlap of instructions and 100% instruc- possible and the maximum number of clock cycles per tion cache miss rate). The best CPI value that can be pipeline stage. Each instruction is assigned the registers reached by a cache analysis alone is 4. it uses and the result of the preceding cache analysis. Table 5 displays the improvement by the pipeline The analysis of a program path is done by repeated analysis. In the second column the CPI value accord- concatenation of instructions. Healy et al. use a bottom ing to the combined cache/pipeline analysis is shown. up algorithm to apply their approach to programs with The improvement factor against the best possible re- loops. For the analysis of the innermost ioop all paths sults of an analysis without cache/pipeline behavior through it are merged. The results of this analysis are used in the next higher loop level as if the inner loop was ‘100% data cache miss rate for load instructions and 0% for stores are assumed because of the SuperSPARC store buffer. a single instruction. The distinction between first and ‘We also did some measurements on a Sun SPARCstation 10 with other loop iterations is not done explicitly but by the a Sup&PARC I under NetBSD. However these measurements are only statistical results, since we couldn’t yet measure without the in- cache classifications (first miss, first hit, always miss, fluence of a non real time multiuser/multitasking operating system. always hit). For the step from an inner to an outer But we believe, that the statistical approximated run time values give a fairly good impression of the capabilities of our pipeline analysis. loop it can be necessary to apply adjustments to the The ratio of predicted run time and statistical evaluated measure- result of the analysis. To avoid underestimations in ments are 1.01 (Isimple), 1.10 (matmult), 1.03 (fib-r) and 2.40 (pi with normal operands). the presence of pipelines they have to use a trick which 3Cache miss penalty of 9 cycles plus a minimum of 4 cycles (8 half involves adding of miss penalties and subtracting them cycle pipeline stages) for an integer instruction. later at outer loop levels. This trick doesn’t work with

42 superscalar processors since for superscalar pipelines a troduction of unknown values leads to several problems. cache hit can result in a speed gain that is higher than For instance if the target address of a store instruction the miss penalty due to grouping effects [18]. happens to be an unknown value, the whole main mem- Another approach to predict the WCET of real time ory becomes unknown. Lundqvist and Stenstrijm try to programs is presented in [lo] by Li, Malik and Wolfe. shrink this problem by reducing the amount of effected This is an integrated solution, where both program path memory through relinking of programs in case of stat- analysis, and cache and pipeline behavior prediction are ically linked routines. To circumvent the simulation of based on integer linear programs. The target processor each path in a loop iteration a path merging strategy is is the i960KB from Intel, which has a pretty simple used. The merging of paths leads to loss of information, pipeline. Thus Li, Malik and Wolfe can limit their ef- i.e. to unknown values. The authors report that this forts to the detection of structural hazards. While ana- loss of information frequently leads to non-termination lyzing simple pipeline and direct-mapped caches is fast, of the simulation, even if the simulated program termi- increasing levels of cache associativity sometimes lead nates. The detection of infeasible paths is also affected to prohibitively high analysis times. by the information loss. The target processor is a Pow- Widely based on [lo], Ottoson and Sjijdin [17] have erPC. developed a framework to estimate WCETs for archi- We are aware of two retargetable pipeline analysis tectures with pipelines, instruction and data caches. To approaches that are based on Maril (Marion’s machine predict the pipeline behavior they also use a kind of description language) of the Marion [2] system of David path concatenation like Healy et al., where the maximal G. Bradlee. Hur et al. [9] have developed a retargetable overlapping of two instructions is computed. Ottoson timing analyzer that has been used to generate analyz- and Sjijdin restrict themselves to pipeline stages as re- ers for the MIPS R3000/R3100 and the Motorola 88000. sources. In an experiment to predict the cache behavior Narasimham and Nilsen describe in [16] a retargetable of a very small program they report analysis times of tool called pipeline simulator compiler that determines several hours. the number of cycles necessary to execute a given in- In [ll] Lim et al. describe a general framework for struction sequenceassuming 100% cache hits. For their the computation of WCETs of programs in the presence tool there are processor descriptions modeling the pipe- of pipelines and caches. To model the pipeline behav- line behavior of the MIPS R2000, the Power PC 601, ior they construct a reservation table of resources for and a SPARC computer architecture. For a more de- each instruction. Registers and pipeline stages are re- tailed discussion of these approaches see [4]. garded as resources. Lim et al. also use a kind of path concatenation. They use a bottom up algorithm that 8 Conclusion and Future Work starts with isolated program constructs. A new reser- vation table is computed, each time an instruction (or We have shown that the semantics based approach can a path) is appended to a path. The reservation tables be used to predict the behavior of modern pipelines. are shortened if possible, by keeping only information The presented semantics was designed for superscalar from the beginning and the end of the path. Lim et al. processors, but is also suitable to model out of order focus on the R3000 processor from MIPS, which has a execution processors. simple five level integer pipeline. The results of our first implementation for the Super- The more recent work of Lim, Han, Kim and Min [12] SPARC I processor show a clear improvement against replaces the reservation tables by instruction depen- the naive approach. dence graphs. In this work they focus on a virtual pro- Our implementation has shown that with the VIVU cessor with an idealized multiple issue pipeline. Like approach it is possible to realize the instruction cache in [ll] concatenation and pruning of paths is done dur- and pipeline behavior prediction independently without ing the execution of the bottom up algorithm. Lim, significant loss of accuracy. The advantages of this ap- Han, Kim and Min don’t consider caches or prefetch proach are that there is no need to bother about worst queues. case paths during our pipeline analysis since our results Lundqvist and Stenstrijm describe a simulation reflect all paths, and that a subsequent path analysis based approach in [13]. The theoretical advantage of a can access context specific information about the be- simulation is that the values of all operands are known havior of instructions. and infeasible paths can thereby be eliminated. Usually Future work includes the development of pipeline this is also the drawback of a simulation approach, since analyses for other processors, especially for processors the input is generally unknown. To circumvent this with out of order execution. Work on the integration problem Lundqvist and Stenstrijm introduce unknown of the pipeline analysis with the data cache analysis is vaZues,i. e. their simulation is capable of handling pro- also in progress. For target systems with out of order grams even if the input values are not known. The in- execution of data memory accessesit is not sufficient to

43 treat data cache and pipeline behavior prediction inde- [8] J. L. Hennessy and D. A. Patterson. Computer Architecture: pendently. A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 2 edition, 1996. Additionally it is planned to integrate the pipeline [9] Y. Hur, Y. H. Bea, S.-S. Lim, B.-D. Rhee, S. L. Min, Y. C. analysis with the path analysis, like it has been done Park. M. Lee. H. Shin. and C. S. Kim. Worst Case Tim- for the cache analysis [20]. ing Analysis df RISC P;ocessors: R3000/R3010 Case Study. A further step can be to incorporate the results of an In Proceedings of the IE.EE Real-Time Systems Symposium, analysis of floating point operands. This can be impor- pages 308-319, Dec. 1995. [lo] Y.-T. S. Li, S. Malik, and A. Wolfe. Efficient Microarchitec- tant for processors like the SuperSPARC I which show ture Modeling and Path Analysis for Real-Time Software. a significant different execution time of floating point In Proceedings of the IEEE Real-Time Systems Symposium, operations in dependence of their operand values. pages 298-307, Dec. 1995. Our goal is to develop a set of tools which allow to [ll] S.-S. Lim, Y. H. Bae, G. T. Jang, B.-D. Rhee, S. L. Min, create a pipeline analysis for a new processor from a C. Y. Park. H. Shin, K. Park. S.-M. Moon. and C. S. Kim. An Accurate \iVorst C&e Timing Analysis fdr RISC Processors. concise description of this processor. IEEE ‘IYansactions on Software Engineering, 21(7):593-604, July 1995. Acknowledgements [12] S.-S. Lim, J. H. Han, J. Kim, and S. L. Min. A worst case timing analysis technique for multiple-issue machines. In Proceedings of the IEEE Real-Time Systems Symposium ‘98, Many members of the compiler design group at the Uni- 1998. versitgt des Saarlandes, especially the members of the [13] T. Lundqvist and P. Stenstriim. Integrating path and tim- USES (UniversitZt des Saarlandes Embedded Systems) ing analysis using instruction-level simulation techniques. group deserve acknowledgement. Reinhard Wilhelm In Proceedings of the ACM SIGPLAN Workshop on Lan- guages, Compilers and Tools for Embedded Systems, vol- provided many valuable hints and suggestions, and has ume 1474 of Lecture Notes in Computer Science, pages 1-15. carefully read draft versions of this work. Florian Mar- Springer, 1998. tin provided his vast knowledge about PAGand adjusted [14] F. Martin. PAG - an efficient program analyzer genera- this tool for the special needs of the pipeline analysis. tor. International Journal on Software Tools for Technology We like to thank Mark D. Hill, James R. Larus, Alvin Transfer, 2(1):46-67, 1998. R. Lebeck, Madhusudhan Talluri, and David A. Wood [15] F. Martin, M. Alt, R. Wilhelm, and C. Ferdinand. Analysis of Loops. In Proceedings of the 7th International Conference for making available the Wisconsin architectural re- on Compiler Construction, volume 1383 of Lecture Notes in search tool set (WARTS), and the anonymous reviewers Computer Science. Springer, March/April 1998. for their helpful comments. [16] K. Narasimham and K. D. Nilsen. Portable Execution Time Analysis for RISC Processors. In Proceedings of the ACM SIGPLAN Workshop on Language, Compiler and Tool Sup- References port for Real-Time Systems, 1994. [l] T. Ball and J. Larus. Optimally Profiling and Tracing Pro- 1171. . G. Ottoson and M. S.jGdin. Worst-Case Execution Time grams. In Proceedings of the 19th ACM Symposium on Prin- Analysis for Modern Hardware Architectures. In Proceed- ciples of Progmmming Languages, pages 59-70, Jan. 1992. ings of the ACM SIGPLAN Workshop on Language, Com- [2] D. G. Bradlee. Retargetable Instruction Scheduling for piler and Tool Support for Real-Time Systems, pages 47-55, Pipelined Processors. PhD Thesis, Technical Report 91-08- June 1997. 07, University of Washington, 1991. [18] J. Schneider. Statische Pipeline-Analyse fiir Echtzeit- [3] P. Cousot and R. Cousot. Abstract Interpretation: A Unified systeme. Diplomarbeit, Universit;it des Saarlandes, Oct. Lattice Model for Static Analysis of Programs bv Construc- 1998. tion or Approximation of Fixpoints. In PFoceedings of the 4th [19] SPARC International, Inc., Menlo Park California, U.S.A. ACM Symposium on Principles of Prvgmmming Languages, The SPARC Architecture Manual, Version 8, 1992. pages 238-252, Los Angeles, CA, 1977. [20] H. Theiling and C. Ferdinand. Combining Abstract Inter- [4] C. Ferdinand. Cache Behavior Prediction for Real-Time Sys- pretation and ILP for Microarchitecture Modelling and Pro- tems. Dissertation, Universitgt des Saarlandes, Sept. 1997. gram Path Analysis. In Proceedings of the IEEE Real-Time [5] C. Ferdinand, F. Martin, and R. Wilhelm. Applying Com- Systems Symposium ‘98, pages 144-153, Dec. 1998. piler Techniques to Cache Behavior Prediction. In Proceed- ings of the ACM SIGPLAN Workshop on Language, Com- piler and Tool Support for Real-Time Systems, pages 3746, June 1997. [6] C. Ferdinand and R. Wilhelm. On predicting data cache behavior for real-time systems. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems, volume 1474 of Lecture Notes in Computer Science, pages 16-30. Springer, 1998. [7] C. A. Healy, D. B. Whalley, and M. G. Harmon. Integrating the Timing Analysis of Pipelining and Instruction Caching. In Proceedings of the IEEE Real-Time Systems Symposium, pages 288-297, Dec. 1995.

44