The Powerpc 604 RISC Microprocessor

The PowerPC 604 RISC Microprocessor The PowerPC 604 RISC microprocessor uses out-of-order and speculative execution techniques to extract instruction-level parallelism. Its nonblocking executionpipelines, fast branch misprediction recovery, and decoupled memory queues support speculative execution. S. Peter Song he 604 microprocessor is the third in Figure 1, this deep pipeline enables the 604 member of the PowerPC family being to achieve its 100-MHz design. The stages are Marvin Denman developed jointly by Apple, IBM, and UMotorola. Developed for use in desk- Fetch. This stage translates an instruction Joe Chang top personal computers, workstations, and ser- fetch address and accesses the cache for up vers, this 32-bit implementation works with the to four instructions. Somerset Design Center software and bus in the PowerPC 601 and 603 Decode. Instruction decoding determines microprocessor^.'-^ While keeping the system needed resources, such as source operands interface compatible with the 601 microproces- and an execution unit. sor, we improved upon it by incorporating a Dispatch. When the resources are available, phase-locked loop and an IEEE-Std 1149.1 dispatch sends instructions to a reservation boundary-scan WAG) interface on chip. In addi- station in the appropriate execution unit. A tion, an advanced machine organization deliv- reservation station permits an instruction to ers one and a half to two times the 601’s integer be dispatched before all of its operands are performance. available.‘ As they become available, the reservation station forwards operands to the Performance strategy execution units. Dispatch can send up to Processor performance depends on three fac- four instructions in program order (in-order tors: the number of instructions in a task, the dispatch) to four of six execution units: two number of cycles a processor takes to complete single-cycle integers, a multicycle integer, a the task, and the processor’s frequency.*s5Our load/store, a floating point, and a branch. architecture, which we optimized to produce Issue/execute. In each execution unit, this compact code while adhering to the reduced stage issues one instruction from its reser- instruction set computer (RISC) philosophy, vation station and executes it to produce addresses the first factor. The high instruction results. The instructions can execute out of execution rate and clock frequency addresses program order (out-of-order execution) the other two factors. The 604 provides deep across the six execution units as well as with- pipelines, multiple execution units, register in an execution unit that has an out-of-order renaming, branch prediction, speculative exe- issue reservation station. Table 1 lists the cution, and serialization. latency and throughput of the execution Six-stage superscalar pipeline. As shown stages. 8 IEEEMicra 0740-7475/94/$04.000 1994 IEEE 111 Authorized licensed use limited to: Universitaetsbibliothek der TU Wien. Downloaded on March 12,2020 at 15:55:02 UTC from IEEE Xplore. Restrictions apply. Completion, An instruction is Branch instructions said to be jinished when it Decode IEvaZh I passes the execute stage. A fin- 1 :;kt IPredict Validate IComplete I ished instruction can be com- Integer instructions pleted 1) if it does not cause an (Fetch (Decode (Dispatch (Execute IComplete ( Write back/ exception condition and 2) Loadstore instructions when all instructions that appear earlier in program IFetch (Decode IDispatch I Cache ( Align IComplete (Write bacq order complete. Tnis is known Floating-point instructions as in-order completion. (Fetch IDecode (Dispatch (Multiply (Add (Rndnorm (Complete (Writeback/ Write buck. This stage writes the results of completed instructions to the architec- Figure 1. Pipeline description. tural state (or the state that is visible to programmer). By- pass logic permits most instructions to complete and Table 1.604 execution timings. write back in one cycle. I I I Instruction Latency Throughput I Although some designs use even deeper pipelines to achieve higher clock frequencies than the 604 does, we felt Most integer 1 1 that such a design point does not suit today’s personal com- Integer multiply (32x32) 4 2 puters. It relies too heavily on one of, or a combination of, a Integer multiply (others) 3 1 very large on-chip cache, a wide data bus, or a fast memory Integer divide 20 19 system to deliver its performance. It would be less than com- Integer load 2 1 petitive in today’s cost-sensitive personal computer market. Floating-point load 3 1 Precise interrupts and register renaming. Most pro- Store 3 1 grammers expect a pipelined processor to behave as a non- Floating-point multiply-add 3 1 pipelined processor, in which one instruction goes through Single-precision floating-point divide 18 18 the fetch to write-back stages before the next one begins. A Double-precision floating-point divide 31 31 processor meets that expectation if it supports precise interrupts, in which it stops at the first instruction that should not be processed. When it stops (to process an interrupt), the completion of up to four instructions per cycle. processor’s state reflects the results of executing all instruc- Unlike Smith and Pleszkun’s reorder buffer, the 604’s tions prior to the interrupt-causing instruction and none of reorder buffer does not store instruction results. Temporary the later instructions,including the interrupt-causinginstruc- buffers hold them until the instructions that generated them tion. This is not a trivial problem to solve in multiple, out-of- complete. At that time, the write-back stage copies the results order execution pipelines. An earlier instruction executing to the architectural registers. The 604 renames registers to after a later instruction can change the processor’s state to achieve this; instead of writing results directly to specified reg- make later instruction processing illegal. Sohi gives a gen- isters, they are written to rename buffers and later copied to eral overview of the design issues and solutions.’ specified registers. Since instructions can execute out of order, The 604 uses a variant of the reorder buffer described by their results can also be produced and written out of order Smith and Pleszkun to implement precise interrupts.8The into the rename buffers. The results are, however, copied from 16-entry reorder buffer keeps track of instruction order as the buffers to the specified registers in program order. Register well as the state of the instructions. The dispatch stage assigns renaming minimizes architectural resource dependencies, each instruction a reorder buffer entry as it is dispatched. namely the output-dependency (or write-after-write hazard) When the instruction finishes execution, the execution unit and antidependency (or write-after-read hazard), that would records the instruction’s execution status in the assigned otherwise limit opportunities for out-of-order execution? reorder buffer entry. Since the reorder buffer is managed as Figure 2 (next page) depicts the format of a rename buffer a first-idfirst-out queue, its examining order matches the entry. The 604 contains a 12-entry rename buffer for the instruction flow sequence. To enforce in-order completion, general-purpose registers (GPRs) that are used for 32-bit inte- all prior instructions in the reorder buffer must complete ger operations. The 604 allocates a GPR rename buffer entry before an instruction can be considered for completion. The upon dispatch of an instruction that modifies a GPR. The dis- reorder buffer examines four entries every cycle to allow patch stage writes a destination register number of the October 1994 9 Authorized licensed use limited to: Universitaetsbibliothek der TU Wien. Downloaded on March 12,2020 at 15:55:02 UTC from IEEE Xplore. Restrictions apply. The 604’s speculative execution strategy complements its branch prediction mechanisms. The strategy is to fetch and execute beyond two unresolved branch instructions. The Figure 2. Rename buffer entry format. results of these speculatively executed instructions reside in rename buffers and in other temporary registers. If the prediction is correct, the write-back stage copies the results of instruction to the Reg num field, sets a Rename valid bit, and speculatively executed instructions to the specified registers clears the Result valid bit. When the instruction executes, the after the instructions complete. execution unit writes its result to the Result field and sets the Upon detection of a branch misprediction, the 604 takes Result valid bit. After the instruction completes, the write- quick action to recover in one cycle. It selectively cancels back stage copies its result from the rename buffer entry to the instructions that belong in the mispredicted path from the GPR specified by the Reg num field, freeing the entry for the reservation stations, execution units, and memory reallocation. For a load-with-update instruction that modi- queues. It also discards their results from the temporary fies two GPRs, one for load data and another for address, buffers. In addition, the processor resumes its previous state the 604 allocates two rename buffer entries. to start executing from the correct path even before the mis- Register renaming complicates the process of locating the predicted branch and its earlier instructions have completed. source operands for an instruction since they can also reside Since the 604 detects a branch misprediction many cycles in rename buffers. In dispatching

The Powerpc 604 RISC Microprocessor

45-Year CPU Evolution: One Law and Two Equations

Comparing the Power and Performance of Intel's SCC to State

Technology & Energy

Introduction to Cpu

AI Acceleration – Image Processing As a Use Case

AI Chips: What They Are and Why They Matter

Summarizing CPU and GPU Design Trends with Product Data

ARM Architecture

Madison Processor

The Supercomputer in Your Pocket

University of Wisconsin-Madison CS/ECE 752 Advanced Computer Architecture I

Cyrix 5X86: Fifth-Generation Design Emphasizes Maximum Performance While Minimizing Transistor Count