The ARM Architecture

1 Agenda . Introduction to ARM Ltd ARM Architecture/Programmers Model Data Path and Pipelines SoC Design Development Tools

2 ARM Ltd

. Founded in November 1990 . Spun out of Acorn . Initial funding from Apple, Acorn and VLSI

. Designs the ARM range of RISC cores . Licenses ARM core designs to semiconductor partners who fabricate and sell to their customers . ARM does not fabricate silicon itself

. Also develop technologies to assist with the design- in of the ARM architecture . Software tools, boards, debug hardware . . architectures . Peripherals, etc

3 ARM’s Activities

Connected Community Development Tools Software IP

Processors memorymemory System Level IP: Data Engines SoCSoC Fabric 3D Graphics

Physical IP

4 ARM Connected Community – 700+

5 5 Huge Range of Applications

IR Fire Detector Exercise Utility Intelligent Machines Energy Efficient Appliances Intelligent toys Meters Vending Tele-parking

Equipment Adopting 32-bit ARM

6 World’s Smallest ARM ?

Battery Solar Cells Wireless Sensor Network

Sensors, timers

Cortex-M0 +16KB RAM 65nm UWB antenna 10 kB Storage memory ~3fW/bit

12µAh Li-ion Battery

A B Processor, SRAM and PMU

Wirelessly networked into large scale sensor arrays

Cortex-M0; 65¢ University of Michigan

7 World’s Largest ARM Computer?

4200 ARM powered Neutrino Detectors

70 bore holes 2.5km deep

60 detectors per string starting 1.5km down

1km3 of active telescope

Work supported by the National Science Foundation and University of Wisconsin-Madison

8 From 1mm3 to 1km3

1mm3 1km3

10¢ $1000

Mobile Home Mobile Embedded Consumer Enterprise PC HPC

9 Agenda

Introduction to ARM Ltd . ARM Architecture/Programmers Model Data Path and Pipelines SoC Design Development Tools

10 ARM Cortex Processors (v7)

. ARM Cortex-A family (v7-A): . Applications processors for full OS and 3rd party applications x1-4 Cortex-A15 ...2.5GHz x1-4 Cortex-A9 . ARM Cortex-R family (v7-R): Cortex-A8 x1-4 . Embedded processors for real-time Cortex-A5 signal processing, control applications 1-2 R Heron Cortex-R4 . ARM Cortex-M family (v7-M): Cortex-M4 SC300™ . -oriented processors Cortex™-M3 for MCU and SoC applications Cortex-M1

Cortex-M0 12k gates...

11 Relative Performance*

2500

2000 ) z h M (

y 1500 c n e u q

e 1000 r F

x a M 500

0 Cortex- Cortex- Cortex-A9 ARM7 ARM926 ARM1026 ARM1136 ARM1176 Cortex-A8 M0 M3 Dual-core Max Freq (MHz) 50 150 184 470 540 610 750 1100 2000 Min Power (mW/MHz) 0.012 0.06 0.35 0.235 0.36 0.335 0.568 0.43 0.5 *Represents attainable speeds in 130, 90, 65, or 45nm processes

12 Cortex family

Cortex-A8 Cortex-R4 Cortex-M3 . Architecture v7A . Architecture v7R . Architecture v7M . MMU . MPU (optional) . MPU (optional) . AXI . AXI . AHB Lite & APB . VFP & NEON support . Dual Issue

13 Cortex-M0 DesignStart

ARM Cortex-M4

“32-bit/DSC” applications Efficient digital signal control

ARM Cortex-M3

“16/32-bit” applications Performance efficiency

ARM Cortex-M0

“8/16-bit” applications Low-cost & simplicity

14 Cortex-M0 DesignStart (2)

ARM Cortex-M0 processor Full product “M0_DS” features options implementation Zero jitter 32-bit RISC core   AMBA AHB-lite interface   M i n

ARMv6-M instruction set architecture i

  m u

NVIC controller   m

u s

Interrupt line configurations 1 to 32 16 only a b l Debug (SWD, JTAG) option  e Up to 4 breakpoints, 2 watchpoints  Low power optimisations (ACG)  Multiple power domain support with WIC  Fast multiplier (1 cycle) option  System timer   Area (gates) 12k – 25k 16K

15 ARM and Thumb Performance

30000

25000

20000 Dhrystone 2.1/sec @ 20MHz 15000 ARM Thumb 10000

5000

0 32-bit 16-bit 16-bit with 32-bit stack

Memory width (zero wait state)

16 The Thumb-2 instruction set

. Variable-length instructions . ARM instructions are a fixed length of 32 bits . Thumb instructions are a fixed length of 16 bits . Thumb-2 instructions can be either 16-bit or 32-bit

. Thumb-2 gives approximately 26% improvement in code density over ARM

. Thumb-2 gives approximately 25% improvement in performance over Thumb

17 Processor Modes

. The ARM has seven basic operating modes:

. User : unprivileged mode under which most tasks run

. FIQ : entered when a high priority (fast) interrupt is raised

. IRQ : entered when a low priority (normal) interrupt is raised

. Supervisor : entered on reset and when a Software Interrupt instruction is executed . Abort : used to handle memory access violations

. Undef : used to handle undefined instructions

. System : privileged mode using the same registers as user mode

18 The ARM Register Set

Current Visible Registers

r0 IRQFIQUndefUserSVCAbort Mode Mode Mode Mode Mode Mode r1 r2 r3 Banked out Registers r4 r5 r6 User FIQ IRQ SVC Undef Abort r7 r8 r8 r8 r9 r9 r9 r10 r10 r10 r11 r11 r11 r12 r12 r12 r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r15 (pc)

cpsr spsr spsr spsr spsr spsr spsr

19 Exception Handling

. When an exception occurs, the ARM: . Copies CPSR into SPSR_ . Sets appropriate CPSR bits . Change to ARM state 0x1C FIQ 0x18 IRQ . Change to exception mode 0x14 (Reserved) . Disable (if appropriate) 0x10 Data Abort . Stores the return address in LR_ 0x0C Prefetch Abort . Sets PC to vector address 0x08 Software Interrupt 0x04 Undefined Instruction . To return, exception handler needs to:0x00 Reset . Restore CPSR from SPSR_ Vector Table . Restore PC from LR_ Vector table can be at 0xFFFF0000 on ARM720T This can only be done in ARM state. and on ARM9/10 family devices

20 Cortex-M3 Programmer’s Model

Main

r0 . Fully programmable in C r1 Stack-based exception model r2 . r3 Only two processor modes r4 . r5 . Mode for User tasks r6 r7 . Handler Mode for OS tasks and exceptions r8 Vector table contains addresses r9 . r10 r11 r12 sp sp lr r15 (pc)

xPSR

21 Conditional Execution and Flags

. ARM instructions can be made to execute conditionally by postfixing them with the appropriate condition code field. . This improves code density and performance by reducing the number of forward branch instructions. CMP r3,#0 CMP r3,#0 BEQ skip ADDNE r0,r1,r2 ADD r0,r1,r2 skip

. By default, data processing instructions do not affect the condition code flags but the flags can be optionally set by using “S”. CMP does not need “S”. loop … SUBS r1,r1,#1 decrement r1 and set flags BNE loop if Z flag clear then branch

22 Data processing Instructions . Consist of : . Arithmetic: ADD ADC SUB SBC RSB RSC . Logical: AND ORR EOR BIC . Comparisons: CMP CMN TST TEQ . Data movement: MOV MVN

. These instructions only work on registers, NOT memory.

. Syntax:

{}{S} Rd, Rn, Operand2

. Comparisons set flags only - they do not specify Rd . Data movement does not specify Rn . Second operand is sent to the ALU via .

23 Using a Barrel Shifter:The 2nd Operand

Operand Operand Register, optionally with shift operation 1 2 . Shift value can be either be: . 5 bit unsigned integer . Specified in bottom of Barrel another register. Shifter . Used for multiplication by constant

Immediate value . 8 bit number, with a range of 0-255. Rotated right through even ALU . number of positions . Allows increased range of 32-bit constants to be loaded directly into Result registers

24 Agenda

Introduction to ARM Ltd ARM Architecture/Programmers Model . Data Path and Pipelines SoC Design Development Tools

25 The ARM7TDM Core

ABE A[31:0] Address Incrementer

Address Register Incrementer

P C BIGEND MCLK nWAIT Register Bank PC Update nRW Instruction A MAS[1:0] Decode Stage Decoder L A B ISYNC Instruction nIRQ U Decompression nFIQ Multiplier nRESET B B and ABORT nTRANS B u u Read Data nMREQ u s Barrel s Register SEQ Control LOCK s Shifter Logic nM[4:0] Write Data nOPC Register nCPI 32 Bit ALU CPA CPB

DBE D[31:0]

26 Cortex-M3

I_HRDATA Instruction Decode

Write Data D_HWDATA Address Register Incrementer D_HRDATA D_HADDR Read Data Address Register Register

B

Address Register Barrel Incrementer Mul/Div Bank Shifter I_HADDR ALU A ALU Address Register Writeback

INTADDR

27 Pipeline changes for ARM9TDMI

ARM7TDMI

ARM decode Instruction Reg Reg ThumbARM Shift ALU Fetch decompress Read Write Reg Select

FETCH DECODE EXECUTE

ARM9TDMI

ARM or Thumb Inst Decode Reg Instruction Shift + ALU Memory Fetch Reg Reg Access Write Decode Read FETCH DECODE EXECUTE MEMORY WRITE

28 Cortex-M3 Pipeline . Cortex-M3 has 3-stage fetch-decode-execute pipeline . Similar to ARM7 . Cortex-M3 does more in each stage to increase overall performance

1st Stage - Fetch 2nd Stage - Decode 3rd Stage - Execute

Address Data Phase AGU Phase & Write Load/Store & Back Branch Instruction Fetch Decode & Write (Prefetch) Multiply & Divide Register Read

Branch Shift ALU & Branch Branch forwarding & speculation

Execute stage branch (ALU branch & Load Store Branch)

29 ARM10 vs. ARM11 Pipelines

ARM10

Branch ARM or Memory Prediction Shift + ALU Thumb Reg Read Access Reg Instruction Write Instruction Multiply Decode Multiply Fetch Add FETCH ISSUE DECODE EXECUTE MEMORY WRITE

ARM11

Shift ALU Saturate

Fetch Fetch MAC MAC MAC Write Decode Issue 1 2 1 2 3 back

Data Data Address Cache 1 2

30 Full Cortex-A8 Pipeline Diagram

13-Stage Integer Pipeline 10-Stage NEON Pipeline A r c h i N t e E c O t u N r a r l e r e g g i s i t s e t e r r f i l f e i l e

31 Agenda

Introduction to ARM Ltd ARM Architecture/Programmers Model Data Path and Pipelines . SoC Design Development Tools

32 An Example AMBA System

High Performance APB ARM processor UART

High Bandwidth AHB Timer APB External Bridge Memory Keypad Interface

High-bandwidth DMA PIO on- RAM Bus Master

Low Power Non-pipelined High Performance Simple Interface Pipelined Burst Support Multiple Bus Masters

33 AHB Structure

Arbiter

HADDR HADDR HWDATA Slave Master HWDATA #1 #1 HRDATA HRDATA

Address/Control Slave #2 Master #2

Write Data Slave Read Data #3 Master #3

Slave #4 Decoder

34 AHB basic signal timing

Address Phase Data Phase A Data Phase B A Address Phase B HCLK

HADDR A B C

HWRITE A B C

HWDATA A B

HRDATA A B

HRESP OKAY A OKAY B

HREADY

35 Mali200 + GP2 SoC Integration

. Shipped as synthesizable Clock Reset IRQs Mali 200 Mali GP2 . Mali 200 + GP2 requires a IDLEs single instant in the SoC, with a small number of AXI Fabric connections to be made.

APB . IDLES can be used for gating the Mali200 and GP2 AXI MMU core clock

36 Typical GPU SoC Design

Mali 200 Mali GP2

ARM1176JZF Int PL390 nRst D I GIC Local AXI Interconnect CLCD Sys PL111 Mali L230 Ctrl SDRAMC DDR MMUM S M PL340 PHY

APB

PL301 High-performance matrix

APB Peripheral Sub-System

. Designed and optimised for AMBA: provides easier integration with ARM cores and fabric IP . Unified Memory Architecture

37 Physical IP*

. Classic (180nm to 90nm): Access to ARM Physical IP . Everything needed to implement a chip . High-quality libraries and memories

. DesignStart: Free access to ARM processor IP . ARM926EJ™ hardened from 180nm to 90nm for major foundry processes . Separate license needed to produce silicon . SoC designs can be done with these models

* Material is currently limited to research programs

38 ARM PIPD Logic Product Families

d e

e High Performance (Advantage-HS / SAGE-HS) p S (SC12)(SC12) h g P i o w H e r

™ High Density (Advantage ) M a

™ n E a High Density (SAGE-X ) (SC9 / SC10)(SC9 SC10) C g O e

m

(SC9 / SC10(SC9 SC10 tappedtapped)) K i e t s n t r

K e a i t

e ™ s w Ultra High Density (Metro ) r o A

P (SC7 / SC8)(SC7 SC8)

w w o o L KitsKitsPower L ECO Kits

Process 180nm 130nm 90nm 65nm 45nm 32nm 28nm Geometry

39 Agenda

Introduction to ARM Ltd ARM Architecture/Programmers Model Data Path and Pipelines SoC Design . Development Tools

40 ARM Debug Architecture

Ethernet

Debugger (+ optional trace tools)

JTAG port Trace Port . EmbeddedICE Logic . Provides breakpoints and processor/system access TAP . JTAG interface (ICE) controller . Converts commands to JTAG ETM signals . Embedded trace Macrocell (ETM) EmbeddedICE Logic . Compresses real-time instruction and data access trace . Contains ICE features (trigger & filter logic) . Trace port analyzer (TPA) ARM . Captures trace in a deep buffer core

41 Development Tools for ARM

. Includes ARM macro assembler, (ARM RealView C/C++ , Keil CARM Compiler, or GNU compiler), ARM linker, Keil uVision Debugger and Keil uVision IDE . Keil uVision Debugger accurately simulates on-chip peripherals (I2C, CAN, UART, SPI, Interrupts, I/O Ports, A/D and D/A converters, PWM, etc.) . Evaluation Limitations . 16K byte object code + 16K data limitation . Some linker restrictions such as base addresses for code/constants . GNU tools provided are not restricted in any way . http://www.keil.com/demo/

42 Keil Development Tools for ARM

43 University Resources

.http://www.arm.com/support/university/

[email protected]

44 Your Future at ARM…

. Graduate and Internship/Co-op Opportunities . Engineering: Memory, Validation, Performance, DFT, R&D, GPU and more! . Sales and Marketing: Corporate and Technical . Corporate: IT, Patents, Services (Training and Support), and Human Resources

. Incredible Culture and Comprehensive Benefit Package . Competitive Reward . Work/Life Balance . Personal Development . Brilliant Minds and Innovative Solutions

. Keep in Touch! . www.arm.com/about/careers

45 TI Panda Board

OMAP4430 Processor . 1 GHz Dual-core ARM Cortex-A9 (NEON+VFP) . C64x+ DSP . PowerVR SGX 3D GPU . 1080p Video Support

POP Memory . 1 GB LPDDR2 RAM

USB Powered . < 4W max consumption (OMAP small % of that) . Many adapter options (Car, wall, battery, solar, ..)

46 Project Ideas Using Panda . OS Projects . OS to ARM/Cortex (TI OMAP) . MythTV system . “Super-Panda” – stack of Pandas as compute engine and task distribution . applications

. NEON Optimization Projects . Codec optimization in (pick your favorite codec) . Voice and image recognition . Open-source Flash player optimizations (swfdec)

47 Fin

48 N95 Multimedia Computer

OMAP™ 2420 Applications Processor ARM1136™ processor-based SoC, developed using Magma ® Blast® family and winner of 2005 INSIGHT Award for ‘Most Innovative SoC’ OS™ v9.2 supporting ARM processor-based mobile devices, developed using ARM® RealView® Compilation Tools

S60™ 3rd Edition S60 Platform supporting ARM processor-based mobile devices Mobiclip™ Video Codec Software video codec for ARM processor-based mobile devices ST WLAN Solution Ultra-low power 802.11b/g WLAN chip with ARM9™ processor-based MAC Connect. Collaborate. Create.

49 Beagle Board

50 Targeting community development Wikis, blogs, $149 Personally promotion of > 1000 participants affordable community and growing activity Active & technical Freedom to community innovate Addressing open source Open access to Instant access to hardware community >10 million lines documentation needs of code Opportunity Free to tinker and software learn

51 Fast, low power, flexible expansion

OMAP3530 Processor Peripheral I/O . 600MHz Cortex-A8 . DVI-D video out . NEON+VFPv3 3” . 16KB/16KB L1$ . SD/MMC+ . 256KB L2$ . S-Video out . 430MHz C64x+ DSP . USB 2.0 HS OTG . 32K/32K L1$ . I2C, I2S, SPI, . 48K L1D MMC/SD . 32K L2 . JTAG . PowerVR SGX GPU . Stereo in/out . 64K on-chip RAM . Alternate power POP Memory . RS-232 serial . 128MB LPDDR RAM . 256MB NAND flash USB Powered . 2W maximum consumption . OMAP is small % of that . Many adapter options . Car, wall, battery, solar, …

52 And more… On-going collaboration at BeagleBoard.org . Live chat via IRC for 24/7 community support . Links to software projects to download

Other Features . 4 LEDs 3” . USR0 Peripheral I/O . USR1 . DVI-D video out . PMU_STAT . SD/MMC+ . PWR . 2 buttons . S-Video out . USER . USB HS OTG . RESET . I2C, I2S, SPI, . 4 boot sources MMC/SD . SD/MMC . JTAG . NAND flash . Stereo in/out . USB . Alternate power . Serial . RS-232 serial

53 Project Ideas Using Beagle . OS Projects . OS porting to ARM/Cortex (TI OMAP) . MythTV system . “Super-Beagle” – stack of Beagles as compute engine and task distribution . Linux applications

. NEON Optimization Projects . Codec optimization in ffmpeg (pick your favorite codec) . Voice and image recognition . Open-source Flash player optimizations (swfdec)

54