The Powerpc 405 Core, Contact an IBM Microelectronics Sales Representative
Total Page:16
File Type:pdf, Size:1020Kb
The PowerPC 405TM Core IBM Microelectronics Division Research Triangle Park, NC 27709 11/2/98 Overview The PowerPC 405 CPU Core is a new addition to the 32-bit RISC PowerPC Embedded Processor family. The 405 Core possesses all of the qualities necessary to make system-on-a-chip designs a reality. This core occupies a small die area, consumes minimal power, and provides a high performance 100% PowerPC compatible platform capable of taming the most demanding embedded applications. Target Applications The 405 Core enables high performance designs in which low cost, low power and versatility are the critical selection criteria. Target market segments for the 405 core include: • Consumer video applications including digital cameras, video games, set-top boxes, and soft modems • Portable products such as cellular phones, PDAs, and handheld GPS receivers • Office automation products such as printer, X-terminals, and fax machines • Networking and storage products such as disk drive controllers, routers, ATM switches, high performance modems, and network cards Typical Application A typical system on a chip design with the 405 Core uses a two level bus structure for system level communication. High bandwidth peripherals and the 405 Core communicate with one another over the processor local bus (PLB). Less demanding peripherals share the on-chip peripheral bus (OPB) and communicate to the PLB through the OPB Bridge. The PLB and OPB provide common interfaces for peripherals and enable quick turnaround, custom solutions for high volume applications. Figure 2 shows an example 405 Core-based system on a chip, illustrating the two-level bus structure and modular core-based design. External Bus Interface MMU Ext. Bus PCI RAM/ROM/ Bridge SDRAM Master Ctr PCMCIA PowerPC Controller 405 CPU Controller ATM 622 SAR I-Cache Code Decompression D-Cache MAL USB Timers P L B JTAG Trace Ethernet OPB DMA PowerPC 405x3 Bridge Controller Embedded Core Arbiter OPB OCM Interrupt Arbiter Control Control APU 2 IC GPIO UART UART HDLC HDLC HDLC 4KB HDLC SRAM Figure 2. Example 405x3 core+ASIC PowerPC 405 Core Page 2 of 14 11/2/98 Specifications • Technology: 0.25 µm CMOS process (0.18 µm Leff) • Frequency: 0-200MHz (200MHz at WC process, 85C, 2.3V) – 276MHz nominal • Performance: 228 Dhrystone 2.1 MIPS@200MHz (est.) • Supply Voltage: 2.5V • Die Size: 2.0 mm2 for CPU only (est.) • Power (typ.): 400mW @ 200MHz (est.), CPU only • 3.04 mW/MHz with 16KB ICU/8KB DCU (est.) • Qualified operating range –40C to 125C, 2.3V to 2.7V Embedded Design Support The 405 Core, as a member of the PowerPC 400 Series, is supported by the IBM PowerPC Embedded ToolsTM program, in which over 80 third party vendors have combined with IBM to provide a complete tools solution. Development tools for the 405 Core include C/C++ compilers, debuggers, bus functional models and real-time operating systems. As part of the tools program, IBM maintains a complete set of development tools by offering the High C/C++ Compiler, RISCWatchTM debugger with RISCTraceTM, OS OpenTM real-time operating system, VHDL and Verilog simulation models and a 405 Core reference design kit. PowerPC 405 Core Page 3 of 14 11/2/98 405 CPU Core Organization PLB Master Instruction Interface OCM MMU 405 CPU Timers I-Cache I-Cache 3-Element (FIT, Array Controller Fetch Fetch PIT, Instruction Shadow and Queue Watchdog) Instruction TLB Decode (PFB1, Cache (4 Entry) Logic PFB0, Unit DCD) Timers & Cache Units Unified TLB (64 Entry) Debug Data Debug Logic Cache Unit Data Shadow Execute Unit (EXU) (4 IAC, TLB 2 DAC, (8 Entry) D-Cache D-Cache 32x32 2 DVC) Array ALU MAC Controller GPR PLB Master Data APU/FPU JTAG Instruction Interface OCM Trace Figure 1. PowerPC 405 CPU core block diagram 405 CPU The 405 CPU operates on instructions in a five stage pipeline consisting of a fetch, decode, execute, write- back, and load write-back stage. A five-stage pipeline increases throughput and makes operating at 200 MHz or more possible. Figure 1 contains a high-level block diagram of the 405 CPU core. Fetch and Decode Logic The fetch and decode logic maintains a steady flow of instructions to the execute unit by placing up to two instructions in the fetch queue. The fetch queue consists of three buffers: pre-fetch buffer 1 (PFB1), pre- fetch buffer 0 (PFB0) and decode (DCD). When the queue is empty, instructions flow directly into the decode stage. Static branch prediction as implemented on the 405 CPU core takes advantage of some standard statistical properties of code. Branches with negative address displacement are by default assumed taken. Branches that do not test the condition or count registers are also predicted as taken. The 405 CPU core bases branch prediction upon these default conditions when a branch is not resolved and speculatively fetches along the predicted path. The default prediction can be overridden by software at assembly or compile time. Branches are examined in the decode and pre-fetch buffer 0 fetch queue stages. Two branch instructions can be handled simultaneously. If the branch in decode is not taken, the fetch logic fetches along the predicted path of the branch instruction in pre-fetch buffer 0. If the branch in decode is taken, the fetch logic ignores the branch instruction in pre-fetch buffer 0. PowerPC 405 Core Page 4 of 14 11/2/98 Execution Unit The 405 CPU core has a single issue execute unit which contains the register file, arithmetic logic unit (ALU) and the multiply-accumulate (MAC) unit.. The execution unit performs all 32-bit PowerPC integer instructions in hardware and is compliant with the IBM PowerPC Embedded Environment specification. Table 1 lists instruction latencies and throughputs. The register file is comprised of thirty-two 32 bit general purpose registers (GPR) which are accessed with three read ports and two write ports. During the decode stage, data is read out of the GPR’s and feed to the execute unit. Likewise, during the write-back stage, results are written to the GPR. The use of the five port on the register file enables either a load or a store operation to execute in parallel with an ALU operation. Instruction Latency Throughput (Clock Cycles) (Clock Cycles) Arithmetic 1 - Traps 2 - Logical 1 - Shifts/rotates 1 - Multiply two sign extended 16 bit half words in 32 bit registers 2- (16 x 16 W 32) Multiply half word (16 x 16 W 32) 2 1a Multiply one sign extended 16 bit half word in a 32 bit register with a 32a 32 bit word (16 x 32 W least significant 32 bits of a 48 bit product) Multiply (32 x 32 W least significant 32 bits of a 64 bit product) 4 3a Multiply (32 x 32 W least significant 32 bits of a 64 bit product) with 54a overflow detection Multiply (32 x 32 W most significant 32 bits of 64 bit product) with 54a or without overflow detection enabled MAC (16 x 16 + 32 W 32) 2 1a Divide 35 - Loadsb 1b 1b Load multiples/strings (caches hit) 1/transfer - Stores 1 - Store multiples/strings (caches hit) 1/transfer - Move to/from Device Control Register 3 - Move to/from Special Purpose Register 1 - Branch known taken 1d or 2e - Branch known not taken 1 - Taken branch predicted takenc 1d or 2e - Taken branch predicted not takenc 2e or 3f - Not taken branch predicted takenc 2e or 3f - Not taken branch predicted not-takenc 1- Notes: a. Throughput applies when no dependencies exist between the result of the multiplication and the source of the subsequent multiplication. However, the target accumulator from one MAC instruction may be used as the source accumulator for the next MAC instruction. b. Load instructions have a 1 cycle use penalty. Data acquired during the execute pipeline stage is not available until after the completion of the write-back pipeline stage. Write-back takes one clock cycle. c. These cycles are for conditional branches that have a dependency. d. The branch is resolved during the fetch pipeline stage while in PFB0. e. The branch is resolved during the decode pipeline stage while in DCD. f. The branch is resolved during the execute pipeline stage. PowerPC 405 Core Page 5 of 14 11/2/98 Table 1. Instruction Latencies and Throughputs The MAC unit is an auxiliary processor unit (APU) which adds 9 operations to the 405 CPU core instruction set. MAC instructions operate on either signed or unsigned 16 bit operands and accumulate the results in a 32 bit GPR. All MAC unit instructions have a single cycle throughput. Table 2 lists the six MAC operations and three halfword multiply operations supported by the MAC unit. RA and RB in Table 2 are the 16 bit operands and RT is the 32-bit accumulate source and target. Most of the instructions have four flavors. Instructions may be unsigned, signed, modulo or saturate. Modulo and saturate determine the behavior of the instruction when the accumulated value overflows. Description Operation Instruction mnemonic multiply accumulate high Prod0:31 U (RA)0:15 x (RB)0:15 machhw – modulo signed halfword to word RT U Prod0:31 + (RT) machhwu - modulo unsigned machhws – saturate signed machhwsu - saturate unsigned multiply accumulate low Prod0:31 U (RA)16:31 x (RB)16:31 maclhw - modulo signed halfword to word RT U Prod0:31 + (RT) maclhwu - modulo unsigned maclhws - saturate signed maclhwsu - saturate unsigned multiply accumulate cross Prod0:31 U (RA)16:31 x (RB)0:15 macchw - modulo signed halfword to word RT U Prod0:31 + (RT) macchwu - modulo unsigned macchws - saturate signed macchwsu - saturate unsigned negative multiply accumulate Prod0:31 U (RA)0:15 x (RB)0:15 nmachhw – modulo signed high halfword to word RT U - Prod0:31 + (RT) nmachhws – saturate signed negative multiply accumulate Prod0:31 U (RA)16:31 x (RB)16:31 nmaclhw – modulo signed low halfword to word RT U - Prod0:31 + (RT) nmaclhws – saturate signed negative multiply accumulate Prod0:31 U (RA)0:15 x (RB)0:15 nmacchw – modulo signed cross halfword to word RT U - Prod0:31 + (RT) nmacchws – saturate signed multiply high halfword to word RTU (RA)0:15 x (RB)0:15 mulhhw – modulo signed mulhhwu – modulo unsigned multiply low halfword to word RT U (RA)16:31 x (RB)16:31 mullhw – modulo signed mullhwu - modulo unsigned multiply cross halfword to word RT U (RA)16:31 x (RB)0:15 mulchw – modulo signed mulchwu -modulo unsigned Table 2.