Quick viewing(Text Mode)

Lessons from the ARM Architecture

Lessons from the ARM Architecture

Lessons from the ARM Architecture

Richard Grisenthwaite Lead Architect and Fellow ARM

1 ARM Applications

2 Overview ƒ Introduction to the ARM architecture ƒ Definition of “Architecture” ƒ History & evolution ƒ Key points of the basic architecture ƒ Examples of ARM implementations ƒ Something to the micro-architects interested ƒ My Lessons on Architecture design ƒ …or what I wish I’d known 15 years ago

3 Definition of “Architecture” ƒ The Architecture is the contract between the Hardware and the Software ƒ Confers rights and responsibilities to both the Hardware and the Software ƒ MUCH more than just the instruction set

ƒ The architecture distinguishes between: ƒ Architected behaviors: ƒ Must be obeyed ƒ May be just the limits of behavior rather than specific behaviors ƒ Implementation specific behaviors – that expose the micro-architecture ƒ Certain areas are declared implementation specific. E.g.: ƒ Power-down ƒ and TLB Lockdown ƒ Details of the Performance Monitors ƒ Code obeying the architected behaviors is portable across implementations ƒ Reliance on implementation specific behaviors gives no such guarantee

ƒ Architecture is different from Micro-architecture ƒ What vs How

4 History ƒ ARM has quite a lot of history ƒ First ARM core (ARM1) ran code in April 1985… ƒ 3 stage pipeline very simple RISC-style processor ƒ Original processor was designed for the Acorn Microcomputer ƒ Replacing a 6502-based design ƒ ARM Ltd formed in 1990 as an “Intellectual Property” company ƒ Taking the 3 stage pipeline as the main building block ƒ This 3 stage pipeline evolved into the ARM7TDMI ƒ Still the mainstay of ARM’s volume ƒ Code compatibility with ARM7TDMI remains very important ƒ Especially at the applications level

ƒ The ARM architecture has features which derive from ARM1 ƒ Strong “applications level” compatibility focus in the ARM products

5 Evolution of the ARM Architecture ƒ Original ARM architecture: ƒ 32-bit RISC architecture focussed on core instruction set ƒ 16 Registers - 1 being the Program – generally accessible ƒ Conditional execution on all instructions ƒ Load/Store Multiple operations - Good for Code Density ƒ Shifts available on data processing and address generation ƒ Original architecture had 26-bit address space ƒ Augmented by a 32-bit address space early in the evolution ƒ Thumb instruction set was the next big step ƒ ARMv4T architecture (ARM7TDMI) ƒ Introduced a 16-bit instruction set alongside the 32-bit instruction set ƒ Different execution states for different instruction sets ƒ Switching ISA as part of a branch or exception ƒ Not a full instruction set – ARM still essential ƒ ARMv4 architecture was still focused on the Core instruction set only

6 Evolution of the Architecture (2) ƒ ARMv5TEJ (ARM926EJ-S) introduced: ƒ Better interworking between ARM and Thumb ƒ Bottom bit of the address used to determine the ISA ƒ DSP-focussed additional instructions ƒ -DBX for code interpretation in hardware ƒ Some architecting of the system ƒ ARMv6K (ARM1136JF-S) introduced: ƒ Media processing – SIMD within the integer ƒ Enhanced exception handling ƒ Overhaul of the memory system architecture to be fully architected ƒ Supported only 1 level of cache ƒ ARMv7 rolled in a number of substantive changes: ƒ Thumb-2* - variable length instruction set ƒ TrustZone* ƒ Jazelle-RCT ƒ Neon

7 Extensions to ARMv7 ƒ MPE – Extensions ƒ Added Cache and TLB Maintenance Broadcast for efficient MP ƒ VE - Virtualization Extensions ƒ Adds hardware support for virtualization: ƒ 2 stages of translation in the memory system ƒ New mode and privilege level for holding an ƒ With associated traps on many system relevant instructions ƒ Support for virtualization ƒ Combines with a System MMU ƒ LPAE – Large Physical Address Extensions ƒ Adds ability to address up to 40-bits of physical address space

8 VFP – ARM’s Floating-point solution ƒ VFP – “Vector Floating-point” ƒ Vector functionality has been deprecated in favour of Neon ƒ Described as a “” ƒ Originally a tightly-coupled coprocessor ƒ Executed instructions from ARM instruction stream via dedicated interface ƒ Now more tightly integrated into the CPU ƒ Single and Double precision floating-point ƒ Fully IEEE compliant ƒ Until VFPv3, implementations required support code for denorms ƒ Alternative Flush to Zero handling of denorms also supported ƒ Recent VFP versions: ƒ VFPv3 – adding more DP registers (32 DP registers) ƒ VFPv4 – adds Fused MAC and Half-precision support (IEEE754-2008)

9 ARM Architecture versions and products ƒ Key architecture revisions and products: ƒ ARMv1-ARMv3: largely lost in the mists of time ƒ ARMv4T: ARM7TDMI – first Thumb processor ƒ ARMv5TEJ(+VFPv2): ARM926EJ-S ƒ ARMv6K(+VFPv2): ARM1136JF-S, ARM1176JFZ-S, ARM11MPCore – first Multiprocessing Core ƒ ARMv7-A+VFPv3 Cortex-A8 ƒ ARMv7-A+MPE+VFPv3: Cortex-A5, Cortex-A9 ƒ ARMv7-A+MPE+VE+LPAE+VFPv4 Cortex-A15

ƒ ARMv7-R : Cortex-R4, Cortex-R5 ƒ ARMv6-M Cortex–M0 ƒ ARMv7-M: Cortex-M3, Cortex-M4

10 ARMv7TDMI

ƒ Simple 3 stage pipeline ƒ Fetch, Decode, Execute ƒ Multiple cycles in execute stage for Loads/Stores ƒ Simple core ƒ “Roll your own memory system”

11 ARM926EJ-S ƒ 5 stage pipeline single issue core ƒ Fetch, Decode, Execute, Memory, Writeback ƒ Most common instructions take 1 cycle in each pipeline stage ƒ Split Instruction/Data Level1 caches Virtually tagged ƒ MMU – hardware table walk based

Java Decode Java Decode Management Read

Thumb Decode Compute Sum/Accumulate Instruction Partial Products & Saturation Register Fetch Register Register Write Shift + ALU Memory Access Decode Read

Instruction ARM Decode Stream Register Register Decode Read

FETCH DECODE EXECUTE MEMORY WRITEBACK

12 ARM1176JZF-S ƒ 8 stage pipeline single issue ƒ Split Instruction/Data Level1 caches Physically tagged ƒ Two cycle memory latency ƒ MMU – hardware page table walk based ƒ Hardware branch prediction

PF1 PF2 DE ISS SH ALU SAT Decode Instr WB I-Cache Access + Issue ALU and MAC Pipeline + StaticB + ex Dynamic P Regist er MAC1 MAC2 MAC3 Branch Prediction RStack Read

LSU Pipeline LS WB DC1 DC2 add LS

13 Cortex-A8 ƒ Dual Issue, in-order ƒ 10 stage pipeline (+ Neon Engine) ƒ 2 levels of cache – L1 I/D split, L2 unified ƒ Aggressive Branch Prediction

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5 M0 M1 M2 M3 N1N2 N3 N4 N5 N6

Branch mispredict penalty Instruction Execute and Load/Store NEON NEON register writeback Replay penalty Integer register writeback Integer ALU pipe ALU pipe 0 Integer MUL pipe MUL pipe 0 NEON Integer shift pipe Instruction ALU pipe 0 Instruction Decode Fetch Instruction Decode Non-IEEE FP ADD pipe LS pipe 0 or 1 Non-IEEE FP MUL pipe

L1 instruction cache miss IEEE FP engine L1 data cache miss L1 data Load and store data queue LS permute pipe L2 data

NEON store data BIU pipeline Embedded Trace Macrocell L1 L2 L3 L4 L5 L6L7 L8 L9 L2 Tag Array L2 Data Array T0T1T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13

External trace port L3 memory system

14 Cortex-A9 ƒ Dual Issue, out-of-order core ƒ MP capable – delivered as clusters of 1 to 4 CPUs ƒ MESI based coherency scheme across L1 data caches ƒ Shared L2 cache (PL310) ƒ Integrated interrupt controller

15 Cortex-A15 – Just Announced - Core Detail ƒ 2.5Ghz in 28 HP ƒ 12 stage in-order, 3-12 stage OoO pipeline ƒ 3.5 DMIPS/Mhz ~ 8750 DMIPS @ 2.5GHz ƒ ARMv7A with 40-bit PA ƒ Dynamic repartitioning Virtualization ƒ Fast state save and restore ƒ Move execution between cores/clusters ƒ 128-bit AMBA 4 ACE ƒ Supports system coherency 64-bit/128bit AMBA4 interface ƒ ECC on L1 and L2 caches

Eagle Pipeline Simple Cluster 0 Simple Cluster 1 Multiply Accumulate

Complex 3-12 Stage Fetch Fetch Decode Rename Decode Rename out-of-order Complex pipeline 12 Stage In-order pipeline Load & Store 0 (capable of 8 (Fetch, 3 Instruction issue) Decode, Rename) Load & Store 1

16 Just some basic lessons from experience ƒ Architecture is part art, part science ƒ There is no one right way to solve any particular problem ƒ ….Though there are plenty of wrong ways ƒ Decision involves balancing many different criteria ƒ Some quantitative, some more subjective ƒ Weighting those criteria is inherently subjective ƒ Inevitably architectures have fingerprints of their architects

ƒ Hennessy & Patterson quantitative approach is excellent ƒ But it is only a framework for analysis – a great toolkit

ƒ science – we experiment using benchmarks/apps ƒ …but the set of benchmarks/killer applications is not constant

ƒ Engineering is all about technical compromise, and balancing factors ƒ The art of Good Enough

17 First Lesson – It’s all about Compatibility

ƒ Customers absolutely expect compatibility ƒ Customers buy your roadmap, not just your products ƒ The software they write is just expected to work ƒ Trying to obsolete features is a long-term task

ƒ Issues from the real world: ƒ Nobody actually really knows what their code uses ƒ …and they’ve often lost the sources/knowledge ƒ People don’t use features as the architect intended ƒ Bright software engineers come up with clever solutions ƒ The clever solutions find inconvenient truths ƒ Compatibility is with what the hardware actually does ƒ Not with how you wanted it to be used ƒ There is a thing called “de facto architecture”

18 Second lesson – orthogonality ƒ Classic computer texts tell you orthogonality is good

ƒ Beware false orthogonalities ƒ ARM architecture R15 being the ƒ Orthogonality says you can do lots of wacky things using the PC ƒ On a simple implementation, the apparent orthogonality is cheap

ƒ ARM architecture has “shifts with all data processing” ƒ Orthogonality from original ARM1 pipeline ƒ But the behaviour has to be maintained into the future ƒ Not all useful control configurations come in powers of 2 ƒ Fear the words “It just falls out of the design” ƒ True for today’s – but what about the next 20 years? ƒ Try to only architect the functionality you think will actually be useful ƒ Avoid less useful functionality that emerges from the micro-architecture

19 Third Lesson – Microarchitecture led features

ƒ Few successful architectures started as architectures ƒ Code runs on implementations, not on architectures ƒ People buy implementations, not architectures ƒ …IP licensing notwithstanding

ƒ Most architectures have “micro-architecture led features” ƒ “it just fell out” ƒ Optimisations based on first target micro-architecture ƒ MIPS – delayed branch slots ƒ ARM – the PC offset of 2 instructions ƒ Made the first implementation cheaper/easier than the pure approach ƒ …but becomes more expensive on subsequent implementations ƒ Surprisingly difficult trade-off ƒ Short-term/long-term balance ƒ Meeting a real need sustainably vs overburdening early implementations

20 Fourth Lesson: New Features ƒ Successful architectures get pulled by market forces ƒ Success in a particular market adds features for that market ƒ Different points of success pull successively over time ƒ Solutions don’t necessarily fit together perfectly ƒ Unsuccessful architectures don’t get same pressures ƒ …which is probably why they appear so clean!

ƒ Be very careful adding new features: ƒ Easy to add, difficult to remove ƒ Especially for user code ƒ “Trap and Emulate” is an illusion of compatibility ƒ Performance differential is too great for most applications

21 Lessons on New Features

ƒ New features rarely stay in the application space you expect…. ƒ …Or want – architects are depressingly powerless ƒ Customers will exploit whatever they find ƒ So worry about the general applicability of that feature ƒ Assume it has to go everywhere in the roadmap ƒ If a feature requires a combination of hardware and specific software… ƒ …be afraid – development timescales are different ƒ Be very afraid of language specific features ƒ All new languages appear to be different… ƒ ….but very rarely are ƒ Avoid solving this years problem in next year’s architecture ƒ ... Next year’s problem may be very different ƒ Point solutions often become warts – all architectures have them ƒ If the feature has a shelf-life, plan for obsolence ƒ Example: Jazelle-DBX in the ARM architecture

22 Thoughts on instruction set design ƒ What is the difference between an instruction and a micro-op? ƒ RISC principles said they were the same ƒ Very few PURE RISC architectures exist today ƒ ARM doesn’t pretend to be “hard-core” RISC ƒ …I’ll claim some RISC credentials ƒ Choice of Micro-ops is micro-architecture dependent ƒ An architecture should be micro-architecturally independent ƒ Therefore mapping of instructions to micro-ops is inherently “risky” ƒ Splitting instructions easier than fusing instructions ƒ If an instruction can plausibly be done in one block, might be right to express it ƒ Even if some micro-architectures are forced to split the instruction ƒ But, remember legacy lasts for a very long time ƒ Imagine explaining the instructions in 5 years time ƒ Avoid having instructions that provide 2 ways to do much the same thing ƒ Everyone will ask you which is better - lots of times… ƒ If it feels a bit clunky when you first design it… ƒ …. it won’t improve over time

23 Final point – Architecture is not enough ƒ Not Enough to ensure perfect “write once, run everywhere”

ƒ Customers expectations of compatibility go beyond architected behaviour ƒ People don’t write always code to the architecture ƒ …and they certainly can’t easily test it to the architecture ƒ ARM is developing tools to help address this ƒ Architecture Envelope Models – a suite of badly behaved legal implementations ƒ The architecture defines whether they are software bugs or hardware incompatibilities ƒ …allows you to assign blame (and fix the problem consistently)

ƒ Beware significant performance anomalies between architectural compliant cores ƒ If I buy a faster core, I want it to go reliably faster…without recompiling

ƒ Multi-processing introduce huge scope for functional differences from timing ƒ Especially in badly synchronised code ƒ Concurrency errors

ƒ BUT THE ARCHITECTURE IS A STRATEGIC ASSET

24 History – the Passage of Time

25 Forum 1992

26 Count the Architectures (11)

ARM MIPS 29K N32x16 PA 88xxx

NVAX Alpha

SPARC x86 x86i960 x86

27 The Survivors – At Most 2

ARM MIPS

SPARC x86

28 What Changed? ƒ Some business/market effects ƒ Some simple scale of economy effects ƒ Some technology effects ƒ Adoption and Ecosystem effects

ƒ It wasn’t all technology – not all of the disappearing architectures were bad ƒ Not all survivors are good! ƒ Not all good science succeeds in the market!

ƒ Never forget “Good enough”

29 Thank you

Questions

30