Chapter 2-1: Cpus

Chapter 2-1: CPUs Soo-Ik Chae © 2007 Elsevier 1 Topics CPU metrics. Categories of CPUs. CPU mechanisms. High Performance Embedded Computing © 2007 Elsevier 2 Performance as a design metric Performance = speed: Latency. Throughput. Average vs. peak performance. Worst-case and best- case performance. High Performance Embedded Computing © 2007 Elsevier 3 Other metrics Cost (area). Energy and p ower. Predictability: important for embedded systems Pipelining: branch penalty. Memory system (Cache) : cache miss penalty Security: difficult to measure because of the fact that we do not know of a successful attack. High Performance Embedded Computing © 2007 Elsevier 4 Flyyypnn’s taxonomy of processors Single-instruction single-data (SISD): RISC, etc. Single-instruction multiple-data (SIMD): all processors perform the same operations. Multiple-instruction multiple-data (MIMD): homogeneou s or heterogeneou s multiprocessor. Multiple-instruction multiple data (MISD). High Performance Embedded Computing © 2007 Elsevier 5 Other axes of comparison RISC. Emphasis on software Sing le-cyclilittile, simple instructions Register to register: LOAD" and "STORE“ are independent instructions Low cycles per second, Large code sizes Spends more transistors on memory registers CISC. Emphasis on hardware multi-cycle, complex instructions Memory-to-memory: LOAD" and "STORE“ incorporated in instructions High cycles per second Small code sizes Transistors used for storing complex instructions High Performance Embedded Computing © 2007 Elsevier 6 RISC CISC 1. 1-cycle simple instructions 1. multi-cycle complex instructions 2. only LD/ST can access memory 2. any instruction may access memory 3. designed around pipeline 3. designed around instn. set 4. instns. executed by h/w 4. instns interpreted by micro-program 5. Fixed format instns 5. variable format instns 6. Few instns and modes 6. Many instns and modes 7. Complexity in the compiler 7. Complexity in the micro-program 8. Multiple register sets 8. Single register set High Performance Embedded Computing © 2007 Elsevier 7 Other axes of comparison Instruction issue width. Single issue Multiple issue: higher performance, high cost, increased power consumption Scheduling for multiple-issue machines. Static scheduling: VLIW Dyypnamic schedule: superscalar Vector processing: instruction for 1D or 2D arrays Multithreading: a fine-grained concurrency mechanism that allows the processor to quickly switch between several threads of execution High Performance Embedded Computing © 2007 Elsevier 8 Embedded vs. general-pppurpose processors EbdddEmbedded processors may be op tiitimize dfd for a category of applications. Must be flexible Customization may be narrow or broad. Billions of 8-bit ppyrocessors sold each year 100s millions of 32-bit processors for embedded systems We may judge embedded processors using different metrics: Code size. MtfMemory system performance. Preditability. High Performance Embedded Computing © 2007 Elsevier 9 ARM Processor Family Processor # of pipeline Memory Clock MIPS/MHz family stages organization Rate ARM6 3 Von Neumann 25 MHz ARM7 3 Von Neumann 66 MHz 0.9 ARM8 5 Von Neumann 72 MHz 121.2 ARM9 5 Harvard 200 MHz 1.1 ARM10 6 Harvard 400 MHz 1.25 StrongARM 5 Harvard 233 MHz 1.15 ARM11 8 Von Neumann/ 550 MHz 1.2 Harvard High Performance Embedded Computing © 2007 Elsevier 10 ARM Architecture Version Summary Core Version Feature ARM1020T v5T Improved ARM/Thumb Interworking CLZ instruction for improved division ARM9E-S, ARM10TDMI, ARM1020E v5TE Extended multiplication and saturated maths for DSP-like f uncti onalit y ARM7EJ-S, ARM926EJ-S, ARM1026EJ-S v5TEJ Jazelle Technology for Java acceleration ARM11, ARM1136J-S, v6 Low power needed SIMD (Single Instruction Multiple Data) media processing extensions J: Jazelle E: Enhanced DSP instruction S: Synthesizable F: integral vector floating point unit High Performance Embedded Computing © 2007 Elsevier 11 ARM7 3-staggppe pipeline org anization A[31:0] control Organizations address register Address gggenerating block Address register P C incrementer Incrementer PC Register bank register bank 31-GPRs, 6-PSRs instruction 2 read, 1 write ports decode Additional 1 read, A multiply & L register U control 1 write port for PC A B b u b b Barrel shifter s u u s barrel s ALU shifter Data register ALU Control logic External interface Instruction decoder data out register data in register Datapath control D[31:0] High Performance Embedded Computing © 2007 Elsevier 12 ARM7 3-staggpe Pipeline ARM7 family has 3 stage pipeline 3 stage pipeline Fetch PC FDE Instruction fetch from memory Decode Instruction decoding Datapath control signals PC+i FDE for the next cycle Execute Reading registers PC+2i F D E Shift and ALU operations Writing back to the register bank High Performance Embedded Computing © 2007 Elsevier 13 ARM7 multi-cycle instructions 1 fetch ADD decode execute 2 fetch STR decode calc. addr. data xfer 3 fetch ADD decode execute 4 fetch ADD decode execute 5 fetch ADD decode execute insttitruction time High Performance Embedded Computing © 2007 Elsevier 14 ARM7 multi-cycle instructions Branch LDR E1 E2 E3 B FD LDR E1 E2 E3 calc link adjust FDcalc xfer move PC+i F FDE discarded PC+2i F FDE discarded T FDE FDE T+i FDE High Performance Embedded Computing © 2007 Elsevier 15 ARM9TDMI core LDR Branch ADD FDEMW FDEMW B F D E1 E2 E3 M W LDR FDEMW F FDEMW F FDEMW Separated cache F D E M W Instruction and data cache are accessible at the same time High Performance Embedded Computing © 2007 Elsevier 16 ARM11 8 stage pipeline Branch Prediction and Return Stack Separate processing units for the ALU, MAC, and Load-Store (LS) instructions althhthiliilthough the pipeline is s iliingle issue High Performance Embedded Computing © 2007 Elsevier 17 Feature Comparison Feature ARM9E ARM10E XScale ARM11 Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6 pipeline length 5 6 7~8 8 Java decode (ARM926EJ) (ARM1026EJ) No Yes V6 SIMD No No No Yes instructions Available as MIA instructions No No Yes coprocessor Branch prediction No Static Dynamic Dynamic Independent No Yes Yes Yes Load-store unit Instruction issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU, MAC, LSU Out-of-order No Yes Yes Yes completion Target Synthesizable and Synthesizable Synthesizable Custom chip implementation Hardmacro Performance 200MHz ~ 350MHz ~ Up to 250MHz Up to 325MHz range > 1GHz > 1GHz High Performance Embedded Computing © 2007 Elsevier 18 MIPS32 ppyrocessor family MIPS: MIPS32 4K has 5-staggppe pipeline ; 4KE family has DSP extension; 4KS is designed for security. High Performance Embedded Computing © 2007 Elsevier 19 MIPS32 ppyrocessor family High Performance Embedded Computing © 2007 Elsevier 20 PowerPC ppyrocessor family PowerPC: 400 series includes several embedded processors; MPD7410 is two- issue machine; 970FX has 16-stage pipeline. High Performance Embedded Computing © 2007 Elsevier 21 PowerPC ppyrocessor family High Performance Embedded Computing © 2007 Elsevier 22 PowerPC ppyrocessor family High Performance Embedded Computing © 2007 Elsevier 23 What is DSP? DSP = Digggital Signal Processing OR DSP = Digital Signal Processor? DSP used to denote both meaning can be de duce d from the con tex t in w hic h the term DSP is used. What is a Digggital Signal Processor ( ()DSP)? Microprocessor specifically designed to perform fast DSP operations (e.g., Fast Fourier Transforms, inner products, Multippyly & Accumulate) High Performance Embedded Computing © 2007 Elsevier 24 DSP performance Wireless Syyqstems requires more and more hi gh performance and higher bandwidth DSP performance PfPerformance might not be 3G enough for ~100,000MIPS future 384-2000 Kb ps applications 2.5G ~10,000MIPS 64-384 Kbps 2G ~100MIPS Bit Rate 8-13 Kbps High Performance Embedded Computing © 2007 Elsevier 25 Digggpital signal processors First DSP was AT&T DSP16: Hardware multiply- accumulate unit. Harvard architecture. Today, DSP is often used as a marketing term. Modern DSPs are heavily pipelined. High Performance Embedded Computing © 2007 Elsevier 26 TMS320C55x ™ DSP Generation, 16-bit FidPitFixed Point – Mos t Power Effic ien t DSP Specifications Features Applications • C55x™ DSP core delivers 300 MHz • Advanced automatic power • Feature-rich, miniaturized per- for up to 600-MIPS performance management sonal and portable products • 1.6-volt core and 3.3-volt • Configurable idle domains to peripherals extend your battery life • 2G, 2.5G and 3G cell phones • Shortened debug for faster time-to- and basestations market • Digital audio players • 144-MHz /200-MHz c loc k ra te • Digital still cameras • 256-KB RAM, 64-KB ROM • Electronic books • Three McBSPs, I 2 C, watchdog • Voice recognition timer, general-purpose timers • GPS receivers • USB 2.0 full-speed (12 Mbps) • Fingerprint/Pattern recognition •10-bit ADC • Wireless modems •real-time clock (RTC) • Headsets • Biometrics High Performance Embedded Computing © 2007 Elsevier 27 TMS320C55x ™ DSP + RISC, 16-bitFidPitbit Fixed Point – OMAP Processor Specifications Features Applications • Dual CPU processor integrating a 150-MHz TI-enhanced ARM925 • Internet appliances TMS320C55x™ DSP core and an • 16 KB instruction cache and 8 KB data ARM925TDMI™ RISC @150 MHz cache • Applications processing • 1.8-volt core and 1.8-volt • Data and instruction MMUs • Enhanced gaming peripherals •32• 32-bit and 16 -bit instruction

Chapter 2-1: Cpus

PPC400 Debugger C++ and JAVAC++ Aswell Asthe Debugging of the TPU

IBM Power System POWER8 Facts and Features

Coverstory by Robert Cravotta, Technical Editor

IBM POWER8 High-Performance Computing Guide: IBM Power System S822LC (8335-GTB) Edition

POWER8: the First Openpower Processor

Powerpc 400 Series Caches: Programming and Coherency Issues

Release History

A História Da Família Powerpc

Real-Time Instruction Trace in the Powerpc 400 Family of Embedded

IBM POWER8 CPU Architecture

The Powerpc 405 Core, Contact an IBM Microelectronics Sales Representative

Powerpc 405 Embedded Processor Core User’S Manual