Embedded Systems Computer EmbeddedEmbedded Architecture SystemsSystems

Jakob Engblom, PhD Uppsala University & Inc. [email protected] [email protected] virtuvirtutechtech

14 Nov 2003 Embedded 2

Embedded Systems Embedded Systems Now what is this elephant thing? You’re all wrong, it is a »“A computer that doesn’t fan! No, a look like a computer” It is a wall! snake! »Interacts with world »Primitive or no user interface »Part of other products No, it is a Part of other products treetrunk!

No, a pillar!

14 Nov 2003 Embedded Computer Architecture 3 14 Nov 2003 Embedded Computer Architecture 4 Embedded Systems Processor Market

»Single purpose products »Embedded = most processors! ¡Not general purpose like desktop PCs ¡200 million PC and server ¡Do one thing very efficiently ¡8000 million embedded » very important: ¡Gives character to product –Used to differentiate inside a “platform” ¡Can be changed late "Embedded" 98% ¡Processor cheaper than special HW "Desktop" ¡Today, dominates dev cost 2%

14 Nov 2003 Embedded Computer Architecture 5 14 Nov 2003 Embedded Computer Architecture 6

Processor Market Real-Time System

» Processors: 100% »Timing as important as result ¡ 50% of all DSP DSP semiconductor revenue 90% 4-bit 4-bit »Hard real-time: ¡ Explains why everyone 80% 8-bit wants to do processors 70% 16-bit ¡Hard deadlines » 32-bit dominant 60% ¡Dead if missed deadline ¡ 30% of total 50% 8-bit ¡Worst-case semiconductors 40% » PC processors: 30% 32-bit »Soft real-time: ¡ 50% of CPU revenue 20% 16-bit ¡Fuzzier deadlines ¡ 15% of total 10% 32-bit ¡ semiconductors 0% ¡Can miss some deadlines ¡ AMD and share it Units Money ¡Average-case

14 Nov 2003 Embedded Computer Architecture 7 14 Nov 2003 Embedded Computer Architecture 8 Real-Time Systems Simple Embedded Systems

»Embedded and Real-Time 8-bit , ¡Synonymous? standard »Most embedded Behavior, talk, Most embedded embedded systems are IR communications real-time embedded real-time 8-bit Hitachi H8/300 »Most real-time 32 kB ROM, 32 kB RAM systems are systems are Standard microcontroller chip embedded real-time Byte-code machine, sensor drivers, …

14 Nov 2003 Embedded Computer Architecture 9 14 Nov 2003 Embedded Computer Architecture 10

Fun App: Smart Beer Glass No Upgrades Possible

Capacitive sensor for »Once a product ships… fluid level » 8-bit, 8-pin »…it often cannot be serviced PIC processor ¡No download ability Contactless transmission of Inductive coil for ¡No writable persistent storage power and RF ID activation ¡No disks readings & power ¡No loader CPU and reading coil in the table. Reports the level of »Software is write-once fluid in the glass, alerts servers when close to empty »(There are exceptions)

14 Nov 2003 Embedded Computer Architecture 11 14 Nov 2003 Embedded Computer Architecture 12 Consumer Electronics Automotive

» Multiple networks » Heterogeneous ¡ CAN for body multiprocessor electronics: 30+ nodes ¡ CAN for engine control: ¡ 8-bit AVR for UI, games, … few nodes ¡ 16-bit fixed-point TI C54 DSP for GSM coding, radio interface, … ¡ LIN for instruments ¡ 32-bit ARM7 in Bluetooth module » Many processors ¡ + maybe ARM7 in IRDA interface ¡ Up to 100 » All in custom chips » Large diversity in processor types: » Software is large: ¡ 8-bit CPUs (PIC, HC08) for door locks, lights, etc. ¡ 16 MB of code in control part ¡ 16-bit CPUs (C167, HC11, HC12) for most functions ¡ Plus signal processing code ¡ 32-bit CPUs (PPC,) for engine control, airbags » Total amount of code: 40-50 MB 14 Nov 2003 Embedded Computer Architecture 13 14 Nov 2003 Embedded Computer Architecture 14

Automotive Timing Aspects

»Form follows function »Interrupt latency ¡Processing where the action is ¡Important criterion for embedded ¡Architecture given by application ¡A few clock cycles at most ¡Sensors and actuators distributed ¡Measure of RTOS performance »Heterogeneous systems »Real-Time = predictability ¡Many different makes of CPUs ¡In-order pipelines ¡Standardized at the network/bus ¡SRAM instead of caches ¡Lockable caches ¡Several small CPUs instead of one big

14 Nov 2003 Embedded Computer Architecture 15 14 Nov 2003 Embedded Computer Architecture 16 Military Shipboard Mobile Phone Base Station

Standard multiprocessor » UltraSparc servers for »Handle signals radar, target tracking, ¡Data streams to and from combat control, … phones ¡Massively parallel system ¡Thousands of DSP tasks Many CPUs in missiles, ¡ gun controls, engines, … ¡Perfect parallel scalability »Custom or standard DSPs ¡Up to 8 DSPs on a single chip

14 Nov 2003 Embedded Computer Architecture 17 14 Nov 2003 Embedded Computer Architecture 18

Trends System-on-a-chip

On-chip bus »Hardware to software »Integration extreme ¡ Increase flexibility, lower cost Data ¡ Software on fast processor can equal HW ¡ Thanks to modern DSP semiconductors mem

»Software to hardware »Entire product Bluetooth ¡ Better power consumption & performance Better power consumption & performance on a chip GSM CPU ¡ Design custom hardware for application »One or more Radio »Hardware-software codesign processors, LCD Code memory ¡ Delay division HW/SW to late in project accelerators, … driver ¡ Obtain “optimal” HW/SW division

14 Nov 2003 Embedded Computer Architecture 19 14 Nov 2003 Embedded Computer Architecture 20

»Classic embedded hardware Embedded »Standard parts Embedded ¡ Quite broad application domains Processing ¡ Sold in large series Processing ¡ Defined by hardware vendors ¡ As cheap as a single dollar »Single processor + devices »Huge number of variants »Usually intended for control plane Microcontrollers

14 Nov 2003 Embedded Computer Architecture 21 14 Nov 2003 Embedded Computer Architecture 22

Microcontroller Microcontroller

»A single chip: RAM »CPU Bitness: 4 to 64 bits (small) ¡CPU Core ROM ¡Most common: 8 bit (4G units) CPU (big) ¡Integrated memory Core ¡32-bit growing fastest ¡Integrated peripherals ¡32/64-bit outnumbers desktop A/D Timer UART ¡Integrated services LCD D »Frequency: DC to 2 Ghz »Goal: Outside World »Memory on-chip: 0.5 kB to 5 MB ¡System on one chip ¡No external HW »Power: mW (and up) ¡Fit application “perfectly” »1/30 to 10 instructions per cycle

14 Nov 2003 Embedded Computer Architecture 23 14 Nov 2003 Embedded Computer Architecture 24 Example: PIC 12CE674 Example: AT91M42800A

» Memory arch: Harvard » ARM7TDMI 32-bit core » Program memory: 2048 x 14 (OTP/Flash) ¡ Static design: 0 to 33 Mhz » EEPROM: 16 bytes » Memory ¡ 8 kB SRAM on chip » RAM: 128 bytes ¡ External memory interface, 8/16 bit interface » ADC channels: 4 (8 bits) » Devices » I/O ports: 6 ¡ 6 timers » Timers: One 8-bit, One WDT ¡ 2 serial ports » Clock: onchip crystal, 10MHz » JTAG debug interface » 144 Pin package » Package: 8 pins ( 4:700 pins) » About 0.5 W power » One of 13 AT91 » Cost: <$1.00 (Pentium 4:>$200.00) » About 18 USD variants

14 Nov 2003 Embedded Computer Architecture 25 14 Nov 2003 Embedded Computer Architecture 26

Devices on the Chip Devices on the Chip

»Interface with the world »Timers ¡Digital I/O ¡ Trigger interrupts ¡Analog/Digital conversion ¡ Watchdogs ¡Digital/Analog conversion »Graphics »Communications ¡ LCD drivers ¡CAN networks ¡ 2D/3D graphics acceleration ¡Ethernet networks »Buses ¡Radio ¡ On-chip: between devices: AMBA, … ¡Serial ports (UART, USART) ¡ Off-chip: PCI, HyperTransport, RapidIO … ¡USB, FireWire, ...

14 Nov 2003 Embedded Computer Architecture 27 14 Nov 2003 Embedded Computer Architecture 28 ASIPs / ASSPs Example: PowerQUICC III

Features » Application-specific » Motorola Serial Communications Controller (SCC) 4 integrated/standard processor » Fast Communications Controller (FCC) 3 integrated/standard processor »Target market Multi-Channel Controller (MCC2) 2 ¡ Targeting a particular niche market ¡ Communications Serial Management Controller (SMC) 2 ¡ More targeted than microcontroller » Processing Serial Peripheral Interface (SPI) 1 I2C controller 1 ¡ Domain-specific accelerators ¡ PowerPC e500 DDR Memory controller 1 » Usually more upscale ¡ 666-1000 Mhz PCI-X/PCI controller 1 ¡ 256 kB L2 cache RapidIO controller 1 ¡ 32-bit processors » Networking Ethernet 10/100/1000 controller 2 ¡ Multiprocessors Networking Capabilities ¡ Expensive peripherals ¡ CPM module, RISC- Ethernet, 10 (from SCC) 4 based microcode ¡ External memory assumed Ethernet, 10/100 (from FCC) 3 Ethernet 10/100/1000 2 ¡ Higher performance, includes data-plane » About 160 USD Utopia II ATM (from FCC) 2

ASIP / ASSP Multichannel HDLC (from MCC2) 256

14 Nov 2003 Embedded Computer Architecture 29 14 Nov 2003 Embedded Computer Architecture 30

Example: C167CS Example: Cisco Toaster3

Devices 8 clusters of 2 Total capacity: » Infineon 8 clusters of 2 CAN 2.0b controllers 2 processors each about 5 GOps, at » Target Market General-Purpose Timers (GPT) 5 around 160 Mhz Watch-Dog Timer (WDT) 1 ¡ ¡ Automotive control Pulse-Width Modulator (PWM) 1 » Processing Analog-Digital Converter Channels 24+8 USART 1 ¡ 16-bit C16x core Synchronous Serial Comms (SSC) 1 ¡ 4-stage simple pipeline Capture/Compare Channels 2x16 Each TMC is a ¡ 40 Mhz operation Each TMC is a Two 32-bit External Ports VLIW machine ¡ 16 MB memory space, VLIW machine ALUs and three including ROM, RAM, CAN interfaces 2 with 74 bit control/data devices 8-bit ports from devices 8 instructions, 2k movement units 16-bit ports from devices 1 » 144 pin package instructions in per TMC Memory local memory ¡ Tolerates -40 C to +125 C ROM 32 kB » About 25 USD Fast General Internal RAM (IRAM) 3 kB Extension Internal RAM (XRAM) 8 kB Image from Report, Oct 2002 14 Nov 2003 Embedded Computer Architecture 31 14 Nov 2003 Embedded Computer Architecture 32 Example: Cisco Toaster3 FPGA

»Massive »Field Programmable Gate Array multiprocessing ¡ Reconfigurable hardware: “soft logic” –“Program” is circuit layout ¡16 cores on a chip –Can be changed after initial load ¡4 chips in serial ¡ Kilos to Megs of ”gates” available ¡Routing: »Competitor to ASICs – 10 Gbps ¡ More expensive per unit, –@ 20 Mpackets/s but no start-up cost for manufacturing –1000 ops per packet ¡ Less flexible, slightly slower passing through ¡ Perfect for low-volume products FPGA

14 Nov 2003 Embedded Computer Architecture 33 14 Nov 2003 Embedded Computer Architecture 34

FPGA Architecture FPGA Architecture

Computation cells »Computation cells ¡Programmable ¡Look-Up Table function –Arbitrary 4-input, –Adder, Logic funcs, ... 1-output function Config –Memory, Registers, ... ¡Coarse-grained RAM –Lots of functionality LUT Input/Output cells –Several LUTs Interconnect –Plus flip-flops etc. ¡Fine-grained ¡Reconfigurable –Little functionality ¡Programmable

14 Nov 2003 Embedded Computer Architecture 35 14 Nov 2003 Embedded Computer Architecture 36 FPGA with CPU Cores Soft CPUs in FPGAs

»CPU on-board FPGA »Processor in the FPGA fabric ¡ HW accelerate critical ¡”Soft” processor tasks in FPGA fabric ¡Special design considerations ¡ Data pumps in FPGA ¡ Control in CPU »Examples CPU ¡ Nios »Cool new possibilities ¡ Microblaze ¡ Reconfigure FPGA online ¡Research projects ¡ Adapt to workloads –Västerås ARM clone –Leon processor also prototyped

14 Nov 2003 Embedded Computer Architecture 37 14 Nov 2003 Embedded Computer Architecture 38

Examples

»Altera Apex 20kC »Altera –“Volume” –“Advanced” CCasease Study:Study: –30k to 1.5M gates –10 Mbit RAM –28 DSP elements »Xilinx Virtex II: –100000 LE ARMARM –“High-end” –1300 user I/O pins –1-4 PPC405 cores –Optimized for Nios 1026EJ1026EJ--SS (optional) –10M gates »ATMEL FPSLIC: –Price at about $1000 –“Low-end” –AVR 8-bit CPU –50k gates

14 Nov 2003 Embedded Computer Architecture 39 14 Nov 2003 Embedded Computer Architecture 40 Overview The Basics: ARM1026EJ-S

»Not a stand-alone processor »For integration in your own chips »Processor package: ¡CPU core ¡Caches, configurable in size ¡Tightly-coupled memories, configurable in size ¡Bus interface ¡MMU (supports WinCE, Symbian, etc.)

14 Nov 2003 Embedded Computer Architecture 41 14 Nov 2003 Embedded Computer Architecture 42

Business Model ASICs

»Sold as an IP Core »Fully custom chips ¡IP = “Intellectual Property” ¡Custom for your application ¡ ¡Not a physical chip, just a design ¡As small or large as necessary ¡”Source code component” »Characteristics ¡Similar in scope to classic processor ¡Expensive to develop –10s of engineers, often 100s »For integration in ASICs ¡Large series necessary to pay off ¡ASIC = Application-specific –At least 100 000 units necessary on average –Mostly for large companies ¡To streamline: build from IP blocks

14 Nov 2003 Embedded Computer Architecture 43 14 Nov 2003 Embedded Computer Architecture 44 IP Blocks CPU Cores

» IP On-chip bus »The biggest “IP” business ¡ Hardware components Data »“Fabless” chip companies ¡ Integrated on chip by DSP customer mem »Biggest players:

» Examples: Bluetooth ¡ARM (best-selling 32-bit architecture) ¡ CPU Cores ¡ Memory GSM CPU ¡MIPS (and its licensees) Radio ¡ Buses »Crowded field ¡ Network interfaces LCD ¡New companies appear monthly ¡ Accelerator circuits Code memory driver ¡Niched components can find a market

14 Nov 2003 Embedded Computer Architecture 45 14 Nov 2003 Embedded Computer Architecture 46

Component Styles Synthesizable Vs Hard IP

» Hard IP: »Synthesizable »Hard IP ¡ Tied to a particular fab process –Like IBM 0.13u Cu, TSMC 0.18, etc. + Use any process + Optimized layout ¡ Black box to customer + Use any fab + Small area » Synthesizable IP: + Customize details + Low power ¡ Source code for compilation by customer + Customize chips + Best performance ¡ Offers configuration options like cache sizes, TCMs ¡ MIPS 24k, ARM 9S, 1026S, 1136S + Add instructions - No flexibility » Soft IP: - Slower memories ¡ Get full source code for the component - Higher power Get full source code for the component For best results, ¡ Purpose is to customize heavily - Lower cores need to be ¡ ARC ARCtangent 5, Tensilica Xtensa V performance redesigned to be synthesizable

14 Nov 2003 Embedded Computer Architecture 47 14 Nov 2003 Embedded Computer Architecture 48 1026EJ-S Core ARM1026EJ-S Pipeline

Static branch »6-stage pipeline: prediction (75% ¡ Max clock, best case: 475 Mhz accurate): uses –Depends on process, synthesis used less power than dynamic ¡ Optimized for synthesis of core Shift/ALU Sat ¡ Integer-only Write »Power: Fetch Issue Decode MAC1 MAC2 ¡ Depends on process & configuration ¡ Quoted numbers: 0.5mW/Mhz –With 16kB+16kB L1 caches LS1 LS2 LS write –130 nm process at TSMC Return stack –(: >35 mW/Mhz) (single entry). Simple but effective 14 Nov 2003 Embedded Computer Architecture 49 14 Nov 2003 Embedded Computer Architecture 50

ARM1026EJ-S Pipeline ARM1026EJ-S Pipeline

Register read, initialize memory ARM/Thumb/Java accesses decode Shift/ALU Sat Shift/ALU Sat Write Write Fetch Issue Decode MAC1 MAC2 Fetch Issue Decode MAC1 MAC2

LS1 LS2 LS write LS1 LS2 LS write

Access to coprocessors Evaluate immediates

14 Nov 2003 Embedded Computer Architecture 51 14 Nov 2003 Embedded Computer Architecture 52 ARM1026EJ-S Pipeline ARM1026EJ-S Pipeline

Handle Execution pipeline saturated for most integer arithmetic instructions Execution pipeline for multiply- Shift/ALU Sat accumulate Shift/ALU Sat instructions Write Write Fetch Issue Decode MAC1 MAC2 Fetch Issue Decode MAC1 MAC2

LS1 LS2 LS write LS1 LS2 LS write

14 Nov 2003 Embedded Computer Architecture 53 14 Nov 2003 Embedded Computer Architecture 54

ARM1026EJ-S Pipeline Rounding Out

»Configurable caches ¡Typically 16kB/16kB

2 stage memoryShift/ALU Sat »Optional TCMs access to support slow synthesized Write »Memory interface memory Fetch Issue Decode MAC1 MAC2 ¡2 x 64 bit AMBA AHB links »Optional vector FP coprocessor LS1 LS2 LS write »Optional vector interrupt

Decoupled pipeline controller for loads and stores

14 Nov 2003 Embedded Computer Architecture 55 14 Nov 2003 Embedded Computer Architecture 56 ARM1026EJ-S System TCM

Debug port connection »Tightly-Coupled Memories

VIC10 ETM10RV »Alternative to caches VFP10 FP interrupt trace/debug coprocessor ¡ As fast as caches coprocessor ¡ Programmer-controlled ARM1026EJ-S ¡ No automatic management TCM I-TCM Core D-TCM ¡ Cheaper to implement ¡ More predictable in behavior »Programming: 64-bit I$ D$ 64-bit FLASH AMBA/AHB AMBA/AHB RAM ¡ In memory map data bus for I data bus for D ¡ Tagged like caches BIU

14 Nov 2003 Embedded Computer Architecture 57 14 Nov 2003 Embedded Computer Architecture 58

Instruction Sets for ARM The ARM Instruction Set

»Base: ARM v5 »Continuous evolution ¡ 32-bit integer-only instruction set ¡Add features required by market »T: thumb instruction set Add features required by market ¡ 16-bit, for smaller core size ¡RISC? Not anymore, if ever »J: Jazelle extensions »Now at v6, in the ARM11 family ¡ Java support in hardware ¡ Implements 140 out of 228 JVM byte codes ¡v5, v5E in ARM9 and ARM10 »E: DSP extensions ¡V4 in old ARM7 ¡ Done in regular registers ¡Backwards compatibility! ¡ Saturation, some more MACs

14 Nov 2003 Embedded Computer Architecture 59 14 Nov 2003 Embedded Computer Architecture 60 T: Thumb T: Thumb

»Compressed instruction set »Thumb shrinks the code: ¡ 16-bit encoding of (parts of) 32-bit instruction set Thumb ARM 386 8088 68020 SPARC ¡ Limitations in ARM/Thumb: eqntott 10608 16768 17640 19106 20542 22256 –Only access to 8 registers (16 in ARM mode) –No system operations 0.63 1.00 1.05 1.14 1.23 1.33 »Effect: xlisp 26388 40768 28097 29401 46746 44648 ¡ More but smaller instructions 0.65 1.00 0.69 0.72 1.15 1.10 –30% more, at half size espresso 72596 109923 125686 137194 131854 142752 ¡ Usually some performance loss 0.66 1.00 1.14 1.25 1.20 1.30 –(Perform better on narrow buses) Source: Microprocessor Report, March 1995

14 Nov 2003 Embedded Computer Architecture 61 14 Nov 2003 Embedded Computer Architecture 62

T2: Doing a Better Thumb Why T?

»ARM Thumb: fixed 16-bit size »Pushed by mobile phones ¡ Saves 28% space compared to 32-bit ARM ¡More memory = more expensive ¡ Runs 20% slower than 32-bit ARM ¡More memory = bigger package »ARM Thumb 2: mixed 16/32 ¡More memory = higher power ¡ Brand new, arrives with ARM1156 ¡ Saves 26% space compared to 32-bit ARM »More features in same memory! ¡ Runs 2% slower than 32-bit ARM »Performance is not critical ¡ (Introduces some new instructions) »Conclusion: mixed length good! Source: Microprocessor Report, June 2003

14 Nov 2003 Embedded Computer Architecture 63 14 Nov 2003 Embedded Computer Architecture 64 T: Competitors J: Jazelle

»Compressed instruction sets »Hardware Java acceleration ¡MIPS16e, shrunk MIPS32 ISA ¡Pushed by mobile phones ¡ARC » ¡Tensilica »Why? ¡To fix Java performance problems »All-small instruction sets ¡SH family »SW JVM problems: »Compressed code ¡Minimal clock frequency = low interpreter performance ¡IBM PowerPC 405 GX ¡JIT requires more memory ¡Decompress when loaded into cache JIT requires more memory

14 Nov 2003 Embedded Computer Architecture 65 14 Nov 2003 Embedded Computer Architecture 66

E: DSP Extensions Why E?

»A few new instructions »Enhance DSP performance ¡Saturated arithmetic –Add, Sub, »Of stand-alone ARM core ¡Signed multiply, MAC »Avoid multipro solution –2 16-bit values in one register –16x16 ¡Hard disk controllers, for example –32x16 ¡Count leading zeroes ¡Load/store pairs of registers »Fairly typical ”DSP” additions

14 Nov 2003 Embedded Computer Architecture 67 14 Nov 2003 Embedded Computer Architecture 68 E: Competition SIMD Extensions

» DSP-in-processor »Heavy-weight addition ¡ “MAC=DSP” ¡ Almost all embedded processors have it ¡New functional units, registers ¡ No revolution in performance ¡Small vector computers » DSP/processor hybrids »Examples: ¡ Infineon Tricore ¡ Microchip DSPic ¡ARM SIMD extensions (in v6) ¡ Hard to get it right, not a big success so far ¡Motorola Altivec » SIMD extensions ¡MIPS ¡ More extensive additions than v5E ¡ Requires new functional units ¡ MMX-SSE-SSE2-3Dnow! ¡ Major performance gain possible ¡SPARC VIS

14 Nov 2003 Embedded Computer Architecture 69 14 Nov 2003 Embedded Computer Architecture 70

SIMD Extensions ARM vs DSP 10 35,1 » Target 9 » ¡ Motorola 8 »Despite “E” and “SIMD”... ¡ PPC 7455 (G4+) OOTB OPT ¡ 1 Ghz 7 »Standard solution: » EEMBC 6 ¡ Telemark suite 5 ¡Dual-core setup ¡ Networking suite » OOTB: 4 ¡ARM core ¡ Out-of-the-box 3 ¡DSP core » OPT: 2 ¡ Manually tuned to use Altivec 1 »Control vs data » Overall/Average: 0 ¡ 3-4 times speed up can be expected FFT 1 Route 1 OSPF 1 OSPF Viterbi 1 Viterbi Bit alloc 1 Autocorr 1 Packet 512 Packet

14 Nov 2003 Embedded Computer Architecture 1 Convolution 71 14 Nov 2003 Embedded Computer Architecture 72 Control vs Data ARM-DSP: TI OMAP 5910

»Control plane: » 24k I$ »Control plane: 96k instr DSP private ƒƒUSBUSB 1.11.1 devices ¡ Standard processor tasks » Target market SRAM ƒƒC55xLCDLCD controllercontroller ¡ Data-intense real-time DSP shared ¡ Decision-making ƒMMC/DSP SDcarddevicesintf ¡ Audio, biometrics, etc. 64k data ƒMMC/SDcard intf ¡ “Integer applications” SRAM ƒƒCorecameracamera interfaceinterfaceSystem » Processing devices ¡ UI of a phone, packet routing, … ƒƒkeyboardkeyboard interfaceinterface 16k I$ ¡ Dual-core chip ARM925ƒƒRTCRTC ARM shared ¡ ARM925T 150 Mhz devices » 8k D$ ƒCPUƒI2CI2C »Data plane: ¡ TI C55 DSP 150 Mhz ARM private Coreƒƒ88 serialserial portsports ¡ Move or process data MMU devices » Power 230 mW ƒƒ33 UARTsUARTs ¡ Performance is key Mem14 GPIO pins LCD ƒƒ1414 GPIOGPIO192k pinspins » Price 32 USD Ctrl ¡ Signal processing, multimedia, … Ctrl Shared SRAM Signal processing, multimedia, … 75 Mhz ¡ Floating/fixed point

14 Nov 2003 Embedded Computer Architecture 73 14 Nov 2003 Embedded Computer Architecture 74

ARM Family: ARM Cores ARM Family: Intel Chips

Performance Performance 2002 2001 ARM11 ARM11 2000 XScale 5-stage pipe 8-stage pipe 1995 Legandary performer 7-10-stage pipe ARM10 Dynamic BP ARM10 Dynamic BP 2000 OOO-completion 800 Mhz 6-stage pipe 550 Mhz StrongARM ARM9E Static BP ARM9E 1998 64-bit BIU Intel makes chips 5-stage pipe FP based on the Xscale; ARM9 I/D caches ARM9 does not license the Java, DSP Intel got this from does not license the rd 5-stage pipe Digital in 1998. core to 3 parties 1994 I/D caches Digital in 1998. A single variant, ARM7 ARM7 big in PDAs. 3-stage pipe unified cache low power Time Time 14 Nov 2003 Embedded Computer Architecture 75 14 Nov 2003 Embedded Computer Architecture 76 Instruction Sets: Configure

»Configurable instruction sets ConfConfigurableigurable ¡Adapt to needs of application ¡User can specialize the processor InstructionInstruction ¡Less waste on generality SetsSets ¡Fast evolution of instruction sets »Traditionally: ¡Chip manufacturers determine instruction sets aimed at some niche ¡Slow evolution of instruction sets

14 Nov 2003 Embedded Computer Architecture 77 14 Nov 2003 Embedded Computer Architecture 78

Instruction Sets: Configure Configurable Instruction Sets

»Subsetting »Tight integration: ¡There is a limited and predefined set of ¡Add to regular pipeline instructions available ¡Additional functional units ¡Easy to compile for: restrict code gen ¡Adding fine-grained instructions ¡Remove instructions to simplify core »Loose integration: »Addition ¡Coprocessor interface ¡Slower communication ¡Freedom to invent instructions ¡Offloading of macro-scale tasks ¡Tool chain: assembly, C compilers ¡Method to invoke accelerator circuits ¡Genuine development of ISAs

14 Nov 2003 Embedded Computer Architecture 79 14 Nov 2003 Embedded Computer Architecture 80 Configurability Trend Benefit of Configurability

» Target » Speedups »Pioneers ¡ Xtensa III ¡ 200 Mhz ¡ Benchmark OOTB OPT ¡Tensilica Xtensa » » EEMBC Telemark ¡Arc Arctangent ¡ Telemark suite ¡ Networking suite overall 1 37 ¡Configurability as key selling point » OOTB: Autocorr 1 9 ¡ Out-of-the-box Out-of-the-box Convolution 1 1249 »Added to general architectures ¡ 25k gate core » OPT: Bit alloc 1 34 ¡ ¡ Tuned code ¡MIPS: “CorExtend” Tuned code FFT 1 24 ¡ 25k base core gates ¡PowerPC: “BookE ASU” ¡ 18k extra instr gates Viterbi ¡ GSM 1 14 100k DSP coproc ¡Usually less tight integration ¡ 37k config gates

14 Nov 2003 Embedded Computer Architecture 81 14 Nov 2003 Embedded Computer Architecture 82

Configuration Tools

instruction set choices

Gate and memory size counters

14 Nov 2003 Embedded Computer Architecture 83