Chapter 2-1: CPUs

Soo-Ik Chae

© 2007 Elsevier 1

Topics

„ CPU metrics. „ Categories of CPUs. „ CPU mechanisms.

High Performance Embedded Computing © 2007 Elsevier 2 Performance as a design metric

„ Performance = speed:

‰ Latency.

‰ Throughput. „ Average vs. peak performance. „ Worst-case and best- case performance.

High Performance Embedded Computing © 2007 Elsevier 3

Other metrics

„ Cost (area). „ Energy and power. „ Predictability: important for embedded systems

‰ Pipelining: branch penalty.

‰ Memory system (Cache) : cache miss penalty „ Security: difficult to measure because of the fact that we do not know of a successful attack.

High Performance Embedded Computing © 2007 Elsevier 4 Flyyypnn’s taxonomy of processors

„ Single-instruction single-data (SISD): RISC, etc. „ Single-instruction multiple-data (SIMD): all processors perform the same operations. „ Multiple-instruction multiple-data (MIMD): homogeneou s or heterogeneou s multiprocessor. „ Multiple-instruction multiple data (MISD).

High Performance Embedded Computing © 2007 Elsevier 5

Other axes of comparison

„ RISC. ‰ Emphasis on software ‰ Sing le-cyclilittile, simple instructions ‰ Register to register: LOAD" and "STORE“ are independent instructions ‰ Low cycles per second, ‰ Large code sizes ‰ Spends more transistors on memory registers „ CISC. ‰ Emphasis on hardware ‰ multi-cycle, complex instructions ‰ Memory-to-memory: LOAD" and "STORE“ incorporated in instructions ‰ High cycles per second ‰ Small code sizes ‰ Transistors used for storing complex instructions

High Performance Embedded Computing © 2007 Elsevier 6 RISC CISC

1. 1-cycle simple instructions 1. multi-cycle complex instructions

2. only LD/ST can access memory 2. any instruction may access memory

3. designed around pipeline 3. designed around instn. set

4. instns. executed by h/w 4. instns interpreted by micro-program

5. Fixed format instns 5. variable format instns

6. Few instns and modes 6. Many instns and modes

7. Complexity in the compiler 7. Complexity in the micro-program

8. Multiple register sets 8. Single register set

High Performance Embedded Computing © 2007 Elsevier 7

Other axes of comparison

„ Instruction issue width.

‰ Single issue

‰ Multiple issue: higher performance, high cost, increased power consumption „ Scheduling for multiple-issue machines.

‰ Static scheduling: VLIW

‰ Dyypnamic schedule: superscalar „ Vector processing: instruction for 1D or 2D arrays „ Multithreading: a fine-grained concurrency mechanism that allows the processor to quickly switch between several threads of execution

High Performance Embedded Computing © 2007 Elsevier 8 Embedded vs. general-pppurpose processors

„ EbdddEmbedded processors may be op tiitimize dfd for a category of applications. ‰ Must be flexible ‰ Customization may be narrow or broad. ‰ Billions of 8-bit ppyrocessors sold each year ‰ 100s millions of 32-bit processors for embedded systems „ We may judge embedded processors using different metrics: ‰ Code size. ‰ MtfMemory system performance. ‰ Preditability.

High Performance Embedded Computing © 2007 Elsevier 9

ARM Processor Family

Processor # of pipeline Memory Clock MIPS/MHz family stages organization Rate ARM6 3 Von Neumann 25 MHz ARM7 3 Von Neumann 66 MHz 0.9 ARM8 5 Von Neumann 72 MHz 121.2 ARM9 5 Harvard 200 MHz 1.1 ARM10 6 Harvard 400 MHz 1.25 StrongARM 5 Harvard 233 MHz 1.15 ARM11 8 Von Neumann/ 550 MHz 1.2 Harvard

High Performance Embedded Computing © 2007 Elsevier 10 ARM Architecture Version Summary

Core Version Feature

ARM1020T v5T „ Improved ARM/Thumb Interworking „ CLZ instruction for improved division

ARM9E-S, ARM10TDMI, ARM1020E v5TE „ Extended multiplication and saturated maths for DSP-like f uncti onalit y

ARM7EJ-S, ARM926EJ-S, ARM1026EJ-S v5TEJ „ Jazelle Technology for Java acceleration

ARM11, ARM1136J-S, v6 „ Low power needed „ SIMD (Single Instruction Multiple Data) media processing extensions J: Jazelle E: Enhanced DSP instruction S: Synthesizable F: integral vector floating point unit

High Performance Embedded Computing © 2007 Elsevier 11

ARM7 3-staggppe pipeline org anization

A[31:0] control „ Organizations address register ‰ Address gggenerating block „ Address register P C incrementer „ Incrementer PC ‰ Register bank register bank „ 31-GPRs, 6-PSRs instruction „ 2 read, 1 write ports decode „ Additional 1 read, A multiply & L register U control 1 write port for PC A B b ‰ u b b Barrel shifter s u u s barrel s ‰ ALU shifter ‰ Data register ALU

‰ Control logic „ External interface „ Instruction decoder data out register data in register „ Datapath control D[31:0]

High Performance Embedded Computing © 2007 Elsevier 12 ARM7 3-staggpe Pipeline

„ ARM7 family has 3 stage pipeline „ 3 stage pipeline

‰ Fetch PC FDE „ Instruction fetch from memory

‰ Decode „ Instruction decoding „ Datapath control signals PC+i FDE for the next cycle

‰ Execute „ Reading registers PC+2i F D E „ Shift and ALU operations „ Writing back to the register bank

High Performance Embedded Computing © 2007 Elsevier 13

ARM7 multi-cycle instructions

1 fetch ADD decode execute

2 fetch STR decode calc. addr. data xfer

3 fetch ADD decode execute

4 fetch ADD decode execute

5 fetch ADD decode execute insttitruction time

High Performance Embedded Computing © 2007 Elsevier 14 ARM7 multi-cycle instructions

„ Branch „ LDR

E1 E2 E3 B FD LDR E1 E2 E3 calc link adjust FDcalc xfer move

PC+i F FDE

discarded

PC+2i F FDE

discarded

T FDE FDE

T+i FDE

High Performance Embedded Computing © 2007 Elsevier 15

ARM9TDMI core

„ LDR „ Branch

ADD FDEMW FDEMW

B F D E1 E2 E3 M W LDR FDEMW

F FDEMW

F

FDEMW

Separated cache F D E M W Instruction and data cache are accessible at the same time

High Performance Embedded Computing © 2007 Elsevier 16 ARM11

„ 8 stage pipeline

‰ Branch Prediction and Return Stack

‰ Separate processing units for the ALU, MAC, and Load-Store (LS) instructions „ althhthiliilthough the pipeline is s iliingle issue

High Performance Embedded Computing © 2007 Elsevier 17

Feature Comparison

Feature ARM9E ARM10E XScale ARM11 Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6 pipeline length 5 6 7~8 8 Java decode (ARM926EJ) (ARM1026EJ) No Yes V6 SIMD No No No Yes instructions Available as MIA instructions No No Yes coprocessor Branch prediction No Static Dynamic Dynamic Independent No Yes Yes Yes Load-store unit Instruction issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU, MAC, LSU Out-of-order No Yes Yes Yes completion Target Synthesizable and Synthesizable Synthesizable Custom chip implementation Hardmacro Performance 200MHz ~ 350MHz ~ Up to 250MHz Up to 325MHz range > 1GHz > 1GHz

High Performance Embedded Computing © 2007 Elsevier 18 MIPS32 ppyrocessor family

„ MIPS: MIPS32 4K has 5-staggppe pipeline ; 4KE family has DSP extension; 4KS is designed for security .

High Performance Embedded Computing © 2007 Elsevier 19

MIPS32 ppyrocessor family

High Performance Embedded Computing © 2007 Elsevier 20 PowerPC ppyrocessor family

„ PowerPC: 400 series includes several embedded processors; MPD7410 is two- issue machine; 970FX has 16-stage pipeline.

High Performance Embedded Computing © 2007 Elsevier 21

PowerPC ppyrocessor family

High Performance Embedded Computing © 2007 Elsevier 22 PowerPC ppyrocessor family

High Performance Embedded Computing © 2007 Elsevier 23

What is DSP?

„ DSP = Digggital Signal Processin g OR DSP = ?

„ DSP used to denote both ‰ meaning can be de duce d from the con tex t in w hic h the term DSP is used. „ What is a Digggital Signal Processor ()(DSP)? ‰ specifically designed to perform fast DSP operations (e.g., Fast Fourier Transforms, inner products, Multippyly & Accumulate )

High Performance Embedded Computing © 2007 Elsevier 24 DSP performance

„ Wireless Syyqstems requires more and more hi gh performance and higher bandwidth

DSP performance PfPerformance might not be 3G enough for ~100,000MIPS future 384-2000 Kb ps applications 2.5G ~10,000MIPS 64-384 Kbps

2G ~100MIPS Bit Rate 8-13 Kbps

High Performance Embedded Computing © 2007 Elsevier 25

Digggpital signal processors

„ First DSP was AT&T DSP16:

‰ Hardware multiply- accumulate unit.

‰ Harvard architecture . „ Today, DSP is often used as a marketing term. „ Modern DSPs are heavily pipelined.

High Performance Embedded Computing © 2007 Elsevier 26 TMS320C55x ™ DSP Generation, 16-bit FidPitFixed Point – Most P ower Effi c ient DSP

Specifications Features Applications

• C55x™ DSP core delivers 300 MHz • Advanced automatic power • Feature-rich, miniaturized per- for up to 600-MIPS performance management sonal and portable products • 1.6-volt core and 3.3-volt • Configurable idle domains to peripherals extend your battery life • 2G, 2.5G and 3G phones • Shortened debug for faster time-to- and basestations market • Digital audio players • 144-MHz /200-MHz c lock ra te • Digital still cameras • 256-KB RAM, 64-KB ROM • Electronic books • Three McBSPs, I 2 C, watchdog • Voice recognition timer, general-purpose timers • GPS receivers • USB 2.0 full-speed (12 Mbps) • Fingerprint/Pattern recognition •10-bit ADC • Wireless modems •real-time clock (RTC) • Headsets • Biometrics

High Performance Embedded Computing © 2007 Elsevier 27

TMS320C55x ™ DSP + RISC, 16-bitFidPitbit Fixed Point – OMAP P rocessor

Specifications Features Applications

• Dual CPU processor integrating a ƒ150-MHz TI-enhanced ARM925 • Internet appliances TMS320C55x™ DSP core and an • 16 KB instruction cache and 8 KB data ARM925TDMI™ RISC @150 MHz cache • Applications processing • 1.8-volt core and 1.8-volt • Data and instruction MMUs • Enhanced gaming peripherals •32• 32-bit and 16 -bit instruction sets • Webpad 150-MHz TMS320C55x™ DSP • Point-of-sale • 12 KW (24 KB) instruction cache • Medical devices • 80 KW (160 KB) SRAM • Industry-specific PDAs • 16 KW (32 KB) ROM • Telematics • Two 16-bit memory interfaces • Digital media processing for SDRAM and flash • Military and government cellular • Nine-channel system DMA controller • LCD controller • USB 1.1 host and client • MMC/SD card interface • Seven serial ports plus three UARTs, Nine timers, Keyboard interface • Less than 250 mW at 1.6 V

High Performance Embedded Computing © 2007 Elsevier 28 TMS320C62x ™ DSP Generation, 16-bit FidPitFixed Point – Hig h Per formance DSP

Specifications Features Applications

• 16-bit fixed-point DSPs • C6000™ DSP Platform VelociTI™ • Pooled modems advanced architecture • Up to 2400 MIPS • Digital Subscriber Line (xDSL) • Up to eight 32-bit instructions •Running at 300 Mhz executed each cycle • Wireless basestations • Eight independent, multi-purpose • Central office switches functional units thirty-two 32-bit • Private Branch Exchange (PBX) registers • Digital imaging • Industry’s most advanced C compiler and Assembly Optimizer • Call processing maximize efficiency and performance • 3D graphics • Speech recognition • Voice over packet

High Performance Embedded Computing © 2007 Elsevier 29

TMS320C67x ™ DSP Generation, 32-bit FltiFloating P Pitoint – Hig h Per formance DSP

Specifications Features Applications

• 32-bit floating point DSPs • C6000™ DSP Platform VelociTI™ • Pooled modems advanced architecture • Up to 1350 MFLOPS • Digital Subscriber Line (xDSL) • Up to eight 32-bit instructions •Running at 225 Mhz executed each cycle • Wireless basestations • Eight independent, multi-purpose • Central office switches functional units thirty-two 32-bit • Private Branch Exchange (PBX) registers • Digital imaging • Industry’s most advanced C compiler and Assembly Optimizer • Call processing maximize efficiency and performance • 3D graphics • IEEE floating-point format • Speech recognition • Up to 1350 MFLOPS at 225 • Voice over packet • Two new multi-channel serial ports (McASP) (C6713 DSP) can support up to stereo channels of I2S (Inter IC Sound) and compatible with S/PDIF transmit protocol. Note I2S is a protocol for transmitting 2 channels of digital audio over a single serial connection

High Performance Embedded Computing © 2007 Elsevier 30 TMS320C64x ™ DSP Generation, 16-bit FidPitFixed Point – Hig h Per formance DSP

Specifications Features Applications

•16-bit fixed point processor • C6000™ DSP Platform VelociTI™ •DSL and pooled modems advanced architecture TMS320C64x DSP high per - •Basestation transceivers • Up to eight 32-bit instructions formance core provides scalable executed each cycle •Wireless LAN performance of up to 1.1 GHz • Eight independent, multi-purpose •Enterprise PBX • The industry’s fastest DSPs with functional units thirty-two 32-bit •Multimedia gateway registers up to 600 MHz (4800 MIPS) •Broadband video transcoders performance • Industry’s most advanced C compiler and Assembly Optimizer •Streaming video servers and • C64x DSPs are software compatible maximize efficiency and performance clients with TI’s C62x™ DSPs •Highspeed raster image ppg()rocessing (RIP)

High Performance Embedded Computing © 2007 Elsevier 31

Example: TI C5x DSP

High Performance Embedded Computing © 2007 Elsevier 32 Example: TI C5x DSP

„ 40-bit arithmetic unit

‰ 32-bit values with 8 guard bits „ Barrel shifter. „ 17 x 17 multiplier. „ Comparison unit for Viterbi encoding/decoding. „ Single-cycle exponent encoder for wide- dyn ami c-raageatnge arithm etcetic. „ Two address generators.

High Performance Embedded Computing © 2007 Elsevier 33

TI C55x microarchitecture

High Performance Embedded Computing © 2007 Elsevier 34 TI C55x co-processors

„ Designed to support

‰ Pixel interpolation A U B ‰ Motion estimation

‰ DCT/IDCT computation

„ ItInterpol ltates U UMR, M, R M R values given A, B, C, D pixels. C D

High Performance Embedded Computing © 2007 Elsevier 35

Fixed Point Vs Floating Point

Floating Point Fixed Point Applications Applications •Modems •Portable Products •Digital Subscriber Line (DSL) •2G, 2.5G and 3G Cell Phones •Wireless Basestations •Digital Audio Players •Central Office Switches •DiDigit ital Still C ameras •Private Branch Exchange (PBX) •Electronic Books •Digital Imaging •Voice Recognition •3D Graphics •GPS Receivers •Sppgeech Recognition •Headsets •Voice over IP •Biometrics •Fingerprint Recognition

High Performance Embedded Computing © 2007 Elsevier 36 Simple VLIW architecture

„ Powerful compiler „ A packet of instruction „ Large register file with multiple ports feeds multiple function units. E box Add123Sb456Ld7fAdd r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St8bSt r8,baz; NOP

Register file

ALU ALU Load/store Load/store FU

High Performance Embedded Computing © 2007 Elsevier 37

Clustered VLIW architecture

„ Register file, function units divided into clusters.

Cluster bus

Execution Execution

Register file Register file

High Performance Embedded Computing © 2007 Elsevier 38 Parallelism extraction in VLIW

„ Static: „ Dynamic:

‰ Use compiler to ‰ Use hardware to analyze program. identify opportunities.

‰ Simpler CPU. ‰ More complex CPU.

‰ Can make use of high- ‰ Can make use of data level language values. constructs.

‰ Can’t depend on data values.

High Performance Embedded Computing © 2007 Elsevier 39

Motorola Starcore SC140

„ DALU inc lu des 4 ALUs, 1 reg is ter file. „ AGU includes 2 address arithmetic units (AAU), 1 address register file. „ Program sequencer and control unit (PSEQ). „ PfPerformance: ‰ 4 MACs per cycle. ‰ 10 RISC MIPS per MH z c loc k.

High Performance Embedded Computing © 2007 Elsevier 40 SC140 Core

Program sequencer Address Data Register file Register file Power mgt 2 AAUs BMU 4 ALUs Clock/PLL

AGU DALU

High Performance Embedded Computing © 2007 Elsevier 41

Typypical SC140 config uration

Level 1 memory expansion RAM, ROM

DMA, Program sequencer Cache, Interrupts, Level 2 mem, ALUALU AAU ALU AAU Etc. ALU

peripherals

High Performance Embedded Computing © 2007 Elsevier 42 Instruction format

„ 16-bit instructions. „ Upppy to 6 instructions per cycle. „ Instructions are grouped to define allowable simultaneous operations.

MACR –D0,D1,D7 AND D4,D5 MOVE.L (R0),+N0,R6 ADDA R2,R3

DALU AGU

High Performance Embedded Computing © 2007 Elsevier 43

AGU addressing

„ Allowable addressing modes:

‰ Linear: useful for gggeneral purpose addressing.

‰ Modulo: useful for FIFO queues.

‰ Reverse-carry: useful for FFT.

‰ Automatic updating during register indirect.

‰ Stack. „ Array addressing: base, offset, modifier regitisters.

High Performance Embedded Computing © 2007 Elsevier 44 TM-1 characteristics

27 function units

„ Characteristics

‰ 5 RISC operations/sec

‰ Floating point support

‰ Sub-word parallelism support

‰ Guarded operation (If Conversion)

‰ Additional custom operations

High Performance Embedded Computing © 2007 Elsevier 45

TM-1VLIWCPU1 VLIW CPU

regifilister file

read/write crossbar

FU1 ... FU27

slot 1 slot 2 slot 3 slot 4 slot 5

High Performance Embedded Computing © 2007 Elsevier 46 Trimedia TM-1

memory interface

video in video out audio in audio out I2C serial

timers VLD co-p image co-p VLIW CPU PCI

High Performance Embedded Computing © 2007 Elsevier 47

Suppperscalar processors

„ Instructions are dynamically scheduled.

‰ Dependencies are checked at run time in hardware. „ Used to some extent in embedded processors.

‰ Embedded Pentium is two-issue in-order.

High Performance Embedded Computing © 2007 Elsevier 48 SIMD and subword parallelism

„ Many special-purpose SIMD machines. „ Subword ppyarallelism is widely used for video.

‰ ALU is divided into subwords for independent operations on small operands. „ Vector processing is widely used for integer values.

High Performance Embedded Computing © 2007 Elsevier 49

SIMD Extensions

High Performance Embedded Computing © 2007 Elsevier 50 SIMD Extensions

High Performance Embedded Computing © 2007 Elsevier 51

Multithreading

„ Low-level parallelism mechanism. „ Hardware multithreadinggy alternately fetches instructions from separate threads. „ Simultaneous multithreading (SMT) fetches instructions from several threads on each cclecycle.

High Performance Embedded Computing © 2007 Elsevier 52 Processor Resource Utilization

„ Processor choice depends on program characteristics.

‰ Leverage our knowledge of the core algorithms „ Many researchers assume that multimedia algorithms exhibit high levels of parallelism.

‰ Experiments with SimpleScalar shows that this is not the case.

‰ Most applications exhibit fewer than 4IPC4 IPC.

High Performance Embedded Computing © 2007 Elsevier 53

Available parallelism in multimedia applicati ons (T all a et al .)

High Performance Embedded Computing © 2007 Elsevier 54 Dyypnamic behavior of loops in MediaBench (Fritts)

„ Path ratio

‰ (instructions executed per iteration) / (total number of loop instructions).

„ Medi a Benc h s hows sma ll pa th ra tio -> considerable conditional behavior in loops.

High Performance Embedded Computing © 2007 Elsevier 55

Operand characteristics in MediaBench

More than 10

78%

High Performance Embedded Computing © 2007 Elsevier 56 Operand characteristics in MediaBench

High Performance Embedded Computing © 2007 Elsevier 57

Operand characteristics in Video Codecs

High Performance Embedded Computing © 2007 Elsevier 58 Dyygg()namic voltage scaling (DVS)

„ PlithVPower scales with V2 while performance scales roughly as V. „ Reduce operating voltage, add parallel operating units to make up for lower clock speed. „ DVS doesn’t work in high-leakage processors.

High Performance Embedded Computing © 2007 Elsevier 59

Dyygqygnamic voltage and frequency scaling (DVFS)

„ Scale both voltage and clock frequency. „ Can use control algorithms to match perftformance to application, reduce power.

High Performance Embedded Computing © 2007 Elsevier 60 Razor architecture

„ Used specialized latch to detect errors. „ Recovers only on errors, gains average-case performance.

High Performance Embedded Computing © 2007 Elsevier 61

Razor architecture

„ Used specialized latch to detect errors. „ Recovers only on errors, gains average-case performance.

High Performance Embedded Computing © 2007 Elsevier 62