Chapter 2-1: CPUs
Soo-Ik Chae
© 2007 Elsevier 1
Topics
CPU metrics. Categories of CPUs. CPU mechanisms.
High Performance Embedded Computing © 2007 Elsevier 2 Performance as a design metric
Performance = speed:
Latency.
Throughput. Average vs. peak performance. Worst-case and best- case performance.
High Performance Embedded Computing © 2007 Elsevier 3
Other metrics
Cost (area). Energy and power. Predictability: important for embedded systems
Pipelining: branch penalty.
Memory system (Cache) : cache miss penalty Security: difficult to measure because of the fact that we do not know of a successful attack.
High Performance Embedded Computing © 2007 Elsevier 4 Flyyypnn’s taxonomy of processors
Single-instruction single-data (SISD): RISC, etc. Single-instruction multiple-data (SIMD): all processors perform the same operations. Multiple-instruction multiple-data (MIMD): homogeneou s or heterogeneou s multiprocessor. Multiple-instruction multiple data (MISD).
High Performance Embedded Computing © 2007 Elsevier 5
Other axes of comparison
RISC. Emphasis on software Sing le-cyclilittile, simple instructions Register to register: LOAD" and "STORE“ are independent instructions Low cycles per second, Large code sizes Spends more transistors on memory registers CISC. Emphasis on hardware multi-cycle, complex instructions Memory-to-memory: LOAD" and "STORE“ incorporated in instructions High cycles per second Small code sizes Transistors used for storing complex instructions
High Performance Embedded Computing © 2007 Elsevier 6 RISC CISC
1. 1-cycle simple instructions 1. multi-cycle complex instructions
2. only LD/ST can access memory 2. any instruction may access memory
3. designed around pipeline 3. designed around instn. set
4. instns. executed by h/w 4. instns interpreted by micro-program
5. Fixed format instns 5. variable format instns
6. Few instns and modes 6. Many instns and modes
7. Complexity in the compiler 7. Complexity in the micro-program
8. Multiple register sets 8. Single register set
High Performance Embedded Computing © 2007 Elsevier 7
Other axes of comparison
Instruction issue width.
Single issue
Multiple issue: higher performance, high cost, increased power consumption Scheduling for multiple-issue machines.
Static scheduling: VLIW
Dyypnamic schedule: superscalar Vector processing: instruction for 1D or 2D arrays Multithreading: a fine-grained concurrency mechanism that allows the processor to quickly switch between several threads of execution
High Performance Embedded Computing © 2007 Elsevier 8 Embedded vs. general-pppurpose processors
EbdddEmbedded processors may be op tiitimize dfd for a category of applications. Must be flexible Customization may be narrow or broad. Billions of 8-bit ppyrocessors sold each year 100s millions of 32-bit processors for embedded systems We may judge embedded processors using different metrics: Code size. MtfMemory system performance. Preditability.
High Performance Embedded Computing © 2007 Elsevier 9
ARM Processor Family
Processor # of pipeline Memory Clock MIPS/MHz family stages organization Rate ARM6 3 Von Neumann 25 MHz ARM7 3 Von Neumann 66 MHz 0.9 ARM8 5 Von Neumann 72 MHz 121.2 ARM9 5 Harvard 200 MHz 1.1 ARM10 6 Harvard 400 MHz 1.25 StrongARM 5 Harvard 233 MHz 1.15 ARM11 8 Von Neumann/ 550 MHz 1.2 Harvard
High Performance Embedded Computing © 2007 Elsevier 10 ARM Architecture Version Summary
Core Version Feature
ARM1020T v5T Improved ARM/Thumb Interworking CLZ instruction for improved division
ARM9E-S, ARM10TDMI, ARM1020E v5TE Extended multiplication and saturated maths for DSP-like f uncti onalit y
ARM7EJ-S, ARM926EJ-S, ARM1026EJ-S v5TEJ Jazelle Technology for Java acceleration
ARM11, ARM1136J-S, v6 Low power needed SIMD (Single Instruction Multiple Data) media processing extensions J: Jazelle E: Enhanced DSP instruction S: Synthesizable F: integral vector floating point unit
High Performance Embedded Computing © 2007 Elsevier 11
ARM7 3-staggppe pipeline org anization
A[31:0] control Organizations address register Address gggenerating block Address register P C incrementer Incrementer PC Register bank register bank 31-GPRs, 6-PSRs instruction 2 read, 1 write ports decode Additional 1 read, A multiply & L register U control 1 write port for PC A B b u b b Barrel shifter s u u s barrel s ALU shifter Data register ALU
Control logic External interface Instruction decoder data out register data in register Datapath control D[31:0]
High Performance Embedded Computing © 2007 Elsevier 12 ARM7 3-staggpe Pipeline
ARM7 family has 3 stage pipeline 3 stage pipeline
Fetch PC FDE Instruction fetch from memory
Decode Instruction decoding Datapath control signals PC+i FDE for the next cycle
Execute Reading registers PC+2i F D E Shift and ALU operations Writing back to the register bank
High Performance Embedded Computing © 2007 Elsevier 13
ARM7 multi-cycle instructions
1 fetch ADD decode execute
2 fetch STR decode calc. addr. data xfer
3 fetch ADD decode execute
4 fetch ADD decode execute
5 fetch ADD decode execute insttitruction time
High Performance Embedded Computing © 2007 Elsevier 14 ARM7 multi-cycle instructions
Branch LDR
E1 E2 E3 B FD LDR E1 E2 E3 calc link adjust FDcalc xfer move
PC+i F FDE
discarded
PC+2i F FDE
discarded
T FDE FDE
T+i FDE
High Performance Embedded Computing © 2007 Elsevier 15
ARM9TDMI core
LDR Branch
ADD FDEMW FDEMW
B F D E1 E2 E3 M W LDR FDEMW
F FDEMW
F
FDEMW
Separated cache F D E M W Instruction and data cache are accessible at the same time
High Performance Embedded Computing © 2007 Elsevier 16 ARM11
8 stage pipeline
Branch Prediction and Return Stack
Separate processing units for the ALU, MAC, and Load-Store (LS) instructions althhthiliilthough the pipeline is s iliingle issue
High Performance Embedded Computing © 2007 Elsevier 17
Feature Comparison
Feature ARM9E ARM10E XScale ARM11 Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6 pipeline length 5 6 7~8 8 Java decode (ARM926EJ) (ARM1026EJ) No Yes V6 SIMD No No No Yes instructions Available as MIA instructions No No Yes coprocessor Branch prediction No Static Dynamic Dynamic Independent No Yes Yes Yes Load-store unit Instruction issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU, MAC, LSU Out-of-order No Yes Yes Yes completion Target Synthesizable and Synthesizable Synthesizable Custom chip implementation Hardmacro Performance 200MHz ~ 350MHz ~ Up to 250MHz Up to 325MHz range > 1GHz > 1GHz
High Performance Embedded Computing © 2007 Elsevier 18 MIPS32 ppyrocessor family
MIPS: MIPS32 4K has 5-staggppe pipeline ; 4KE family has DSP extension; 4KS is designed for security .
High Performance Embedded Computing © 2007 Elsevier 19
MIPS32 ppyrocessor family
High Performance Embedded Computing © 2007 Elsevier 20 PowerPC ppyrocessor family
PowerPC: 400 series includes several embedded processors; MPD7410 is two- issue machine; 970FX has 16-stage pipeline.
High Performance Embedded Computing © 2007 Elsevier 21
PowerPC ppyrocessor family
High Performance Embedded Computing © 2007 Elsevier 22 PowerPC ppyrocessor family
High Performance Embedded Computing © 2007 Elsevier 23
What is DSP?
DSP = Digggital Signal Processin g OR DSP = Digital Signal Processor?
DSP used to denote both meaning can be de duce d from the con tex t in w hic h the term DSP is used. What is a Digggital Signal Processor ()(DSP)? Microprocessor specifically designed to perform fast DSP operations (e.g., Fast Fourier Transforms, inner products, Multippyly & Accumulate )
High Performance Embedded Computing © 2007 Elsevier 24 DSP performance
Wireless Syyqstems requires more and more hi gh performance and higher bandwidth
DSP performance PfPerformance might not be 3G enough for ~100,000MIPS future 384-2000 Kb ps applications 2.5G ~10,000MIPS 64-384 Kbps
2G ~100MIPS Bit Rate 8-13 Kbps
High Performance Embedded Computing © 2007 Elsevier 25
Digggpital signal processors
First DSP was AT&T DSP16:
Hardware multiply- accumulate unit.
Harvard architecture . Today, DSP is often used as a marketing term. Modern DSPs are heavily pipelined.
High Performance Embedded Computing © 2007 Elsevier 26 TMS320C55x ™ DSP Generation, 16-bit FidPitFixed Point – Most P ower Effi c ient DSP
Specifications Features Applications
• C55x™ DSP core delivers 300 MHz • Advanced automatic power • Feature-rich, miniaturized per- for up to 600-MIPS performance management sonal and portable products • 1.6-volt core and 3.3-volt • Configurable idle domains to peripherals extend your battery life • 2G, 2.5G and 3G cell phones • Shortened debug for faster time-to- and basestations market • Digital audio players • 144-MHz /200-MHz c lock ra te • Digital still cameras • 256-KB RAM, 64-KB ROM • Electronic books • Three McBSPs, I 2 C, watchdog • Voice recognition timer, general-purpose timers • GPS receivers • USB 2.0 full-speed (12 Mbps) • Fingerprint/Pattern recognition •10-bit ADC • Wireless modems •real-time clock (RTC) • Headsets • Biometrics
High Performance Embedded Computing © 2007 Elsevier 27
TMS320C55x ™ DSP + RISC, 16-bitFidPitbit Fixed Point – OMAP P rocessor
Specifications Features Applications
• Dual CPU processor integrating a 150-MHz TI-enhanced ARM925 • Internet appliances TMS320C55x™ DSP core and an • 16 KB instruction cache and 8 KB data ARM925TDMI™ RISC @150 MHz cache • Applications processing • 1.8-volt core and 1.8-volt • Data and instruction MMUs • Enhanced gaming peripherals •32• 32-bit and 16 -bit instruction sets • Webpad 150-MHz TMS320C55x™ DSP • Point-of-sale • 12 KW (24 KB) instruction cache • Medical devices • 80 KW (160 KB) SRAM • Industry-specific PDAs • 16 KW (32 KB) ROM • Telematics • Two 16-bit memory interfaces • Digital media processing for SDRAM and flash • Military and government cellular • Nine-channel system DMA controller • LCD controller • USB 1.1 host and client • MMC/SD card interface • Seven serial ports plus three UARTs, Nine timers, Keyboard interface • Less than 250 mW at 1.6 V
High Performance Embedded Computing © 2007 Elsevier 28 TMS320C62x ™ DSP Generation, 16-bit FidPitFixed Point – Hig h Per formance DSP
Specifications Features Applications
• 16-bit fixed-point DSPs • C6000™ DSP Platform VelociTI™ • Pooled modems advanced architecture • Up to 2400 MIPS • Digital Subscriber Line (xDSL) • Up to eight 32-bit instructions •Running at 300 Mhz executed each cycle • Wireless basestations • Eight independent, multi-purpose • Central office switches functional units thirty-two 32-bit • Private Branch Exchange (PBX) registers • Digital imaging • Industry’s most advanced C compiler and Assembly Optimizer • Call processing maximize efficiency and performance • 3D graphics • Speech recognition • Voice over packet
High Performance Embedded Computing © 2007 Elsevier 29
TMS320C67x ™ DSP Generation, 32-bit FltiFloating P Pitoint – Hig h Per formance DSP
Specifications Features Applications
• 32-bit floating point DSPs • C6000™ DSP Platform VelociTI™ • Pooled modems advanced architecture • Up to 1350 MFLOPS • Digital Subscriber Line (xDSL) • Up to eight 32-bit instructions •Running at 225 Mhz executed each cycle • Wireless basestations • Eight independent, multi-purpose • Central office switches functional units thirty-two 32-bit • Private Branch Exchange (PBX) registers • Digital imaging • Industry’s most advanced C compiler and Assembly Optimizer • Call processing maximize efficiency and performance • 3D graphics • IEEE floating-point format • Speech recognition • Up to 1350 MFLOPS at 225 • Voice over packet • Two new multi-channel serial ports (McASP) (C6713 DSP) can support up to stereo channels of I2S (Inter IC Sound) and compatible with S/PDIF transmit protocol. Note I2S is a protocol for transmitting 2 channels of digital audio over a single serial connection
High Performance Embedded Computing © 2007 Elsevier 30 TMS320C64x ™ DSP Generation, 16-bit FidPitFixed Point – Hig h Per formance DSP
Specifications Features Applications
•16-bit fixed point processor • C6000™ DSP Platform VelociTI™ •DSL and pooled modems advanced architecture TMS320C64x DSP high per - •Basestation transceivers • Up to eight 32-bit instructions formance core provides scalable executed each cycle •Wireless LAN performance of up to 1.1 GHz • Eight independent, multi-purpose •Enterprise PBX • The industry’s fastest DSPs with functional units thirty-two 32-bit •Multimedia gateway registers up to 600 MHz (4800 MIPS) •Broadband video transcoders performance • Industry’s most advanced C compiler and Assembly Optimizer •Streaming video servers and • C64x DSPs are software compatible maximize efficiency and performance clients with TI’s C62x™ DSPs •Highspeed raster image ppg()rocessing (RIP)
High Performance Embedded Computing © 2007 Elsevier 31
Example: TI C5x DSP
High Performance Embedded Computing © 2007 Elsevier 32 Example: TI C5x DSP
40-bit arithmetic unit
32-bit values with 8 guard bits Barrel shifter. 17 x 17 multiplier. Comparison unit for Viterbi encoding/decoding. Single-cycle exponent encoder for wide- dyn ami c-raageatnge arithm etcetic. Two address generators.
High Performance Embedded Computing © 2007 Elsevier 33
TI C55x microarchitecture
High Performance Embedded Computing © 2007 Elsevier 34 TI C55x co-processors
Designed to support
Pixel interpolation A U B Motion estimation
DCT/IDCT computation
ItInterpol ltates U UMR, M, R M R values given A, B, C, D pixels. C D
High Performance Embedded Computing © 2007 Elsevier 35
Fixed Point Vs Floating Point
Floating Point Fixed Point Applications Applications •Modems •Portable Products •Digital Subscriber Line (DSL) •2G, 2.5G and 3G Cell Phones •Wireless Basestations •Digital Audio Players •Central Office Switches •DiDigit ital Still C ameras •Private Branch Exchange (PBX) •Electronic Books •Digital Imaging •Voice Recognition •3D Graphics •GPS Receivers •Sppgeech Recognition •Headsets •Voice over IP •Biometrics •Fingerprint Recognition
High Performance Embedded Computing © 2007 Elsevier 36 Simple VLIW architecture
Powerful compiler A packet of instruction Large register file with multiple ports feeds multiple function units. E box Add123Sb456Ld7fAdd r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St8bSt r8,baz; NOP
Register file
ALU ALU Load/store Load/store FU
High Performance Embedded Computing © 2007 Elsevier 37
Clustered VLIW architecture
Register file, function units divided into clusters.
Cluster bus
Execution Execution
Register file Register file
High Performance Embedded Computing © 2007 Elsevier 38 Parallelism extraction in VLIW
Static: Dynamic:
Use compiler to Use hardware to analyze program. identify opportunities.
Simpler CPU. More complex CPU.
Can make use of high- Can make use of data level language values. constructs.
Can’t depend on data values.
High Performance Embedded Computing © 2007 Elsevier 39
Motorola Starcore SC140
DALU inc lu des 4 ALUs, 1 reg is ter file. AGU includes 2 address arithmetic units (AAU), 1 address register file. Program sequencer and control unit (PSEQ). PfPerformance: 4 MACs per cycle. 10 RISC MIPS per MH z c loc k.
High Performance Embedded Computing © 2007 Elsevier 40 SC140 Core
Program sequencer Address Data Register file Register file Power mgt 2 AAUs BMU 4 ALUs Clock/PLL
AGU DALU
High Performance Embedded Computing © 2007 Elsevier 41
Typypical SC140 config uration
Level 1 memory expansion RAM, ROM
DMA, Program sequencer Cache, Interrupts, Level 2 mem, ALUALU AAU ALU AAU Etc. ALU
peripherals
High Performance Embedded Computing © 2007 Elsevier 42 Instruction format
16-bit instructions. Upppy to 6 instructions per cycle. Instructions are grouped to define allowable simultaneous operations.
MACR –D0,D1,D7 AND D4,D5 MOVE.L (R0),+N0,R6 ADDA R2,R3
DALU AGU
High Performance Embedded Computing © 2007 Elsevier 43
AGU addressing
Allowable addressing modes:
Linear: useful for gggeneral purpose addressing.
Modulo: useful for FIFO queues.
Reverse-carry: useful for FFT.
Automatic updating during register indirect.
Stack. Array addressing: base, offset, modifier regitisters.
High Performance Embedded Computing © 2007 Elsevier 44 TM-1 characteristics
27 function units
Characteristics
5 RISC operations/sec
Floating point support
Sub-word parallelism support
Guarded operation (If Conversion)
Additional custom operations
High Performance Embedded Computing © 2007 Elsevier 45
TM-1VLIWCPU1 VLIW CPU
regifilister file
read/write crossbar
FU1 ... FU27
slot 1 slot 2 slot 3 slot 4 slot 5
High Performance Embedded Computing © 2007 Elsevier 46 Trimedia TM-1
memory interface
video in video out audio in audio out I2C serial
timers VLD co-p image co-p VLIW CPU PCI
High Performance Embedded Computing © 2007 Elsevier 47
Suppperscalar processors
Instructions are dynamically scheduled.
Dependencies are checked at run time in hardware. Used to some extent in embedded processors.
Embedded Pentium is two-issue in-order.
High Performance Embedded Computing © 2007 Elsevier 48 SIMD and subword parallelism
Many special-purpose SIMD machines. Subword ppyarallelism is widely used for video.
ALU is divided into subwords for independent operations on small operands. Vector processing is widely used for integer values.
High Performance Embedded Computing © 2007 Elsevier 49
SIMD Extensions
High Performance Embedded Computing © 2007 Elsevier 50 SIMD Extensions
High Performance Embedded Computing © 2007 Elsevier 51
Multithreading
Low-level parallelism mechanism. Hardware multithreadinggy alternately fetches instructions from separate threads. Simultaneous multithreading (SMT) fetches instructions from several threads on each cclecycle.
High Performance Embedded Computing © 2007 Elsevier 52 Processor Resource Utilization
Processor choice depends on program characteristics.
Leverage our knowledge of the core algorithms Many researchers assume that multimedia algorithms exhibit high levels of parallelism.
Experiments with SimpleScalar shows that this is not the case.
Most applications exhibit fewer than 4IPC4 IPC.
High Performance Embedded Computing © 2007 Elsevier 53
Available parallelism in multimedia applicati ons (T all a et al .)
High Performance Embedded Computing © 2007 Elsevier 54 Dyypnamic behavior of loops in MediaBench (Fritts)
Path ratio
(instructions executed per iteration) / (total number of loop instructions).
Medi a Benc h s hows sma ll pa th ra tio -> considerable conditional behavior in loops.
High Performance Embedded Computing © 2007 Elsevier 55
Operand characteristics in MediaBench
More than 10
78%
High Performance Embedded Computing © 2007 Elsevier 56 Operand characteristics in MediaBench
High Performance Embedded Computing © 2007 Elsevier 57
Operand characteristics in Video Codecs
High Performance Embedded Computing © 2007 Elsevier 58 Dyygg()namic voltage scaling (DVS)
PlithVPower scales with V2 while performance scales roughly as V. Reduce operating voltage, add parallel operating units to make up for lower clock speed. DVS doesn’t work in high-leakage processors.
High Performance Embedded Computing © 2007 Elsevier 59
Dyygqygnamic voltage and frequency scaling (DVFS)
Scale both voltage and clock frequency. Can use control algorithms to match perftformance to application, reduce power.
High Performance Embedded Computing © 2007 Elsevier 60 Razor architecture
Used specialized latch to detect errors. Recovers only on errors, gains average-case performance.
High Performance Embedded Computing © 2007 Elsevier 61
Razor architecture
Used specialized latch to detect errors. Recovers only on errors, gains average-case performance.
High Performance Embedded Computing © 2007 Elsevier 62