Aspects of ISAs Begin with VonNeumann model • Implicit structure of all modern ISAs • CPU + memory (data & insns) • Sequential instructions

• Format • Length and encoding • Operand model Aspects of ISAs • Where (other than memory) are operands stored? • Datatypes and operations • Control

27 28

The Sequential Model Instruction Length and Format Length • Implicit model of all modern ISAs • Fixed length • Basic feature: the program (PC) Fetch PC Fetch[PC] • Most common is 32 bits • Defines total order on dynamic instruction Decode Decode + Simple implementation (next PC often just PC+4) • Next PC is PC++ (except for ctrl insns) Read Inputs Read Inputs – Code density: 32 bits to increment a register by 1 • Order + named storage define computation Execute • Variable length Execute Write Output Value flows from X to Y via storage A iff: +Code density Write Output Next PC insn X A output A + can do increment in one 8-bit instruction Next PC insn YA  input A – Complex fetch (where does next instruction begin?) • logically executes loop at left • Compromise: two lengths • Instruction execution assumed atomic • E.g., MIPS16 or ARM’s Thumb • Instruction X finishes before insn X+1 starts Encoding • More parallel alternatives have been proposed • A few simple encodings simplify decoder • x86 decoder one nasty piece of logic 29 30

Example Instruction Encodings Operations and Datatypes MIPS • Datatypes • Fixed length • S/W: attribute of data • H/W: attribute of operation, data is just 0/1’s • 32-bits, 3 formats, simple encoding Fetch Decode • All processors support R-type Op(6) Rs(5) Rt(5) Rd(5) Sh(5) Func(6) Read Inputs • 2’s comp. integer arithmetic/logic Execute (8/16/32/64-bit) I-type Op(6) Rs(5) Rt(5) Immed(16) Write Output • IEEE754 floating-point arithmetic (32/64 bit) Next Insn • Intel has 80-bit floating-point J-type Op(6) Target(26) • Most processors now support x86 • “Packed-integer” insns, e.g., MMX • Variable length encoding (1 to 16 bytes) • “Packed-fp” insns, e.g., SSE/SSE2 • For multimedia, more about these later • Processors no longer (??) support Prefix*(1-4) Op OpExt* ModRM* SIB* Disp*(1-4) Imm*(1-4) • Decimal, other fixed-point arithmetic • Binary-coded decimal (BCD) 31 32 Where Does Data Live? How Much Memory? Address Size • What does “64-bit” in a 64-bit ISA mean? • Memory • Support memory size of 264 • Fundamental storage space Fetch • Alternative (wrong) definition: width of calculation Decode operations Read Inputs • Registers • “Virtual” address size Execute Write Output • Faster than memory, quite handy • Determines size of addressable (usable) memory Next Insn • Most processors have these too • x86 evolution: • 4-bit (4004), 8-bit (8008), 16-bit (8086), 24-bit (80286), • 32-bit + protected memory (80386) • Immediates • 64-bit (AMD’s Opteron & Intel’s EM64T Pentium4) Values spelled out as bits in instructions • • Most ISAs moving to 64 bits (if not already there) • Input only

33 34

How Many Registers? How Are Memory Locations Specified?

• Registers faster than memory, have as many as possible? • Registers are specified directly • No • Register names are short, encoded in instructions • One reason registers are faster: there are fewer of them • Some instructions implicitly read/write certain registers • Small is fast (hardware truism) • Another: they are directly addressed (no address calc) – More of them, means larger specifiers • How are addresses specified? – Fewer registers per instruction or indirect addressing • Addresses are long (64-bit) • Not everything can be put in registers • : how are insn bits converted to • Structures, arrays, anything pointed-to addresses? – More registers  more saving/restoring • Trend: more registers: 8 (x86)32 (MIPS) 128 (IA64) • 64-bit x86 has 16 64-bit integer and 16 128-bit FP registers

35 36

Memory Addressing Addressing Modes Examples • Addressing mode: way of specifying address • MIPS • Used in mem-mem or load/store instructions in register ISA • Displacement: R1+offset (16-bit) • Examples • Experiments showed this covered 80% of accesses on VAX • Register-Indirect: R1=mem[R2] • Displacement: R1=mem[R2+immed] • x86 (MOV instructions) • Index-base: R1=mem[R2+R3] • Absolute: zero + offset (8/16/32-bit) • Memory-indirect: R1=mem[mem[R2]] • Register indirect: R1 • Auto-increment: R1=mem[R2], R2= R2+1 • Indexed: R1+R2 • Auto-indexing: R1=mem[R2+immed], R2=R2+immed • Displacement: R1+offset (8/16/32-bit) • Scaled: R1=mem[R2+R3*immed1+immed2] • Scaled: R1 + (R2*Scale) + offset(8/16/32-bit) • PC-relative: R1=mem[PC+imm] Scale = 1, 2, 4, 8 • What high-level program idioms are these used for? • What implementation impact? What impact on insn count? 2 more issues: alignment &

37 38 How Many Explicit Operands / ALU Insn? Operand Model Pros and Cons • Operand model: how many explicit operands / ALU insn? • Metric I: static code size • 3: general-purpose • Want: many implicit operands (stack), high level insns add R1,R2,R3 means [R1] = [R2] + [R3] (MIPS) • 2: multiple explicit accumulators (output also input) • Metric II: data memory traffic add R1,R2 means [R2] = [R2] + [R1] (x86) • Want: many long-lived operands on-chip (load-store) • 1: one implicit accumulator add R1 means ACC = ACC + [R1] • Metric III: CPI • 0: hardware stack • Want: short latencies, little variability (load-store) add means STK[TOS++] = STK[--TOS] + STK[--TOS] • 4+: useful only in special situations • CPI and data memory traffic more important these days • Examples show register operands but operands can be memory addresses, or mixed register/memory • Trend: most new ISAs are load-store ISAs or hybrids • ISA w/register-only ALU insns are = load-store architecture

39 40

Control Transfers Control Transfers I: Targets

• Default next-PC is PC + sizeof(current insn) • The issues • Note: PC called IR (instruction register) in x86 • How far (statically) do you need to jump? (w/in fn vs outside) Fetch • Do you need to jump to a different place each time? Decode • Branches and jumps can change that • How many bits do you need to encode the target? Read Inputs • Otherwise dynamic program == static program • PC-relative Execute • Not useful Write Output • Position-independent within procedure Next Insn • Computing targets: where to jump to • Used for branches and jumps within a procedure • For all branches and jumps • Absolute • Absolute / PC-relative / indirect • Position independent outside procedure • Used for procedure calls • Testing conditions: whether to jump at all • For (conditional) branches only • Indirect (target found in register) • Compare-branch / condition-codes / condition • Needed for jumping to dynamic targets registers • For returns, dynamic procedure calls, statements

41 42

Control Transfers II: Testing Conditions ISAs Also Include Support For… • Compare and branch insns • Operating systems & memory protection branch-less-than R1,10,target • Privileged mode +Simple – Two ALUs (for condition & target address) • System call (TRAP) –Extra latency • Exceptions & interrupts • Implicit condition codes (x86) • Interacting with I/O devices cmp R1,10 // sets “negative” CC/flag branch-neg target + More room for target, condition codes set “for free” • Multiprocessor support + Branch insn simple and fast – Implicit dependence is tricky • “Atomic” operations for synchronization • Conditions in regs, separate branch (MIPS) set-less-than R2,R1,10 • Data-level parallelism branch-not-equal-zero R2,target – Additional insns • Pack many values into a wide register + one ALU per insn, explicit dependence • Intel’s SSE2: 4x32-bit float-point values in 128-bit register > 80% of branches are (in)equalities/comparisons to 0 • Define parallel operations (four “adds” in one cycle)

43 44 RISC and CISC • RISC: reduced-instruction set computer • Coined by Patterson in early 80’s • Berkeley RISC-I (Patterson), Stanford MIPS (Hennessy), IBM 801 (Cocke), PowerPC, ARM, SPARC, Alpha, PA-RISC • CISC: complex-instruction set computer • Term didn’t exist before “RISC” • x86, VAX, Motorola 68000, etc.

• Philosophical war (one of several) started in mid 1980’s • RISC “won” the technology battles The RISC vs. CISC Debate • CISC won the high-end commercial war (1990s to today) • Compatibility a stronger force than anyone (but Intel) thought • RISC won the embedded computing war

45 46

The Setup The RISC Tenets • Pre 1980 • Single-cycle execution • Bad compilers (so assembly written by hand) • CISC: many multicycle operations • Hardwired control • Complex, high-level ISAs (easier to write assembly) • CISC: microcoded multi-cycle operations • Around 1982 • Load/store architecture • Moore’s Law makes fast single-chip • CISC: register-memory and memory-memory possible… …but only for small, simple ISAs • Few memory addressing modes • CISC: many modes • Performance advantage of “integration” was compelling • Fixed-length instruction format • Compilers had to get involved in a big way • CISC: many formats and lengths • Reliance on compiler optimizations RISC manifesto: create ISAs that… • CISC: hand assemble to get good performance • Many registers (compilers are better at using them) • Simplify single-chip implementation • CISC: few registers • Facilitate optimizing compilation

47 48

CISCs and RISCs The Debate • RISC argument • The CISCs: x86, VAX (Virtual Address eXtension to PDP-11) • CISC is fundamentally handicapped by complexity • Variable length instructions: 1-321 bytes!!! • For a given technology, RISC will be better (faster) • 14 GPRs + PC + stack-pointer + condition codes • Current technology enables single-chip RISC • Data sizes: 8, 16, 32, 64, 128 bit, decimal, string • When it enables single-chip CISC, RISC will be pipelined • Memory-memory instructions for all data sizes • When it enables pipelined CISC, RISC will have caches • Special insns: crc, insque, polyf, and a cast of hundreds • When it enables CISC with caches, RISC will have next thing... • x86: “Difficult to explain and impossible to love” • The RISCs: MIPS, PA-RISC, SPARC, PowerPC, Alpha, ARM • CISC rebuttal • 32-bit instructions • CISC flaws not fundamental, fixable with more transistors • 32 integer registers, 32 floating point registers, load-store • Moore’s Law will narrow the RISC/CISC gap (true) • 64-bit virtual address space • Good pipeline: RISC = 100K transistors, CISC = 300K • Few addressing modes (Alpha has 1, SPARC/PowerPC more) • By 1995: 2M+ transistors had evened playing field • Why so many? Everyone wanted their own • Software costs dominate, compatibility is paramount

49 50 Current Winner (Volume): RISC Current Winner (Revenue): CISC

• ARM (Acorn RISC Machine  Advanced RISC Machine) • x86 was first 16-bit microprocessor by ~2 years • First ARM chip in mid-1980s (from Acorn Computer Ltd). • IBM put it into its PCs because there was no competing choice • 1.2 billion units sold in 2004 (>50% of all 32/64-bit CPUs) • Rest is historical inertia and “financial feedback” • Low-power and embedded devices (iPod, for example) • x86 is most difficult ISA to implement and do it fast but… • Significance of embedded? ISA compatibility less powerful force • Because Intel sells the most non-embedded processors… • 32-bit RISC ISA • It has the most money… • 16 registers, PC is one of them • Which it uses to hire more and better engineers… • Many addressing modes, e.g., auto increment • Which it uses to maintain competitive performance … • And given competitive performance, compatibility wins… • Condition codes, each instruction can be conditional • So Intel sells the most non-embedded processors… • Multiple implementations • AMD as a competitor keeps pressure on x86 performance • X-scale (design was DEC’s, bought by Intel, sold to Marvel) • Others: Freescale (was Motorola), Texas Instruments, • Moore’s law has helped Intel in a big way STMicroelectronics, Samsung, Sharp, Philips, etc. • Most engineering problems can be solved with more transistors

51 52

Intel’s Compatibility Trick: RISC Inside Enter Micro-Ops (1) • 1993: Intel wanted out-of-order execution in Pentium Pro Most instructions are a single micro-op, uop • Hard to do with a coarse grain ISA like x86 • Add, xor, compare, branch, etc. • Solution? Translate x86 to RISC ops in hardware • Loads example: mov -4(%rax), %ebx push $eax becomes (we think, uops are proprietary) • Stores example: mov %ebx, -4(%rax) store $eax [$esp-4] Each operation on a memory location  micro-ops++ addi $esp,$esp,-4 • “addl -4(%rax), %ebx” = 2 uops (load, add) + Processor maintains x86 ISA externally for compatibility • “addl %ebx, -4(%rax)” = 3 uops (load, add, store) + But executes RISC ISA internally for implementability What about address generation? • Given translator, x86 almost as easy to implement as RISC • Simple address generation: single micro-op • Intel implemented out-of-order before any RISC company • Also, OoO also benefits x86 more (because ISA limits compiler) • Complicated (scaled addressing) & sometimes store addresses: calculated separately • Idea co-opted by other x86 companies: AMD and Transmeta

53 54

Enter Micro-Ops (2) Cracking Macro-ops into Micro-ops Function call (CALL) – 4 uops Two forms of μop “cracking” • Get , store program counter to stack, • Hard-coded logic: fast, but expensive (for insn in few μops) adjust stack pointer, unconditional jump to function start • Simple Decoder: 11 • Complex Decoder: 1 2-4 Return from function (RET) – 3 uops • 4x in size • Adjust stack pointer, load return address from stack, jump • Table Lookup: slow, but “off to the side” (not shown) to return address  doesn’t complicate rest of machine Other operations • Handles really complicated instructions

• String manipulations instructions Simple Decoded/Cracked • For example STOS is around six micro-ops, etc. Fetched Instructions Decoder Instructions ComplexComplex ComplexComplex add stos cmp ret sub DecoderDecoderret DecoderDecoder Micro-ops: part of the , not the architecture stosstos stos Big Table stos (far away) 55 56 Micro-Op changes over time Ultimate Compatibility Trick x86 code is becoming more “RISC-like”. • Support old ISA with… IA32  x86-64: • …a simple processor for that ISA in the system 1. Double number of registers • How first Itanium supported x86 code 2. Better function calling conventions • x86 processor (comparable to Pentium) on chip • Result? Fewer pushes, pops, and complicated instructions • How PlayStation2 supported PlayStation games ~1.6 μops / macro-op  ~1.1 μops / macro-op • Used PlayStation processor for I/O chip & emulation

Fusion: Intel’s newest processors fuse certain instruction pairs • Macro-op fusion: fuses “compare” and “branch” instructions • 2 macro-ops  1 simple micro-op (uses simple decoder) • Micro-op fusion: fuses ld/add pairs, fuses store “addr” & “data” • 1 complex macro-op  1 simple macro-op (uses simple decoder)

57 58

Translation and Virtual ISAs RISC & CISC for Performance

• New compatibility interface: ISA + translation software Recall performance equation: • Binary-translation: transform static image, run native seconds instructions x cyclesx seconds • Emulation: unmodified image, interpret each dynamic insn, program = program optimize on-the-fly • Examples: FX!32 (x86 on Alpha), Rosetta (PowerPC on x86) CISC (Complex Instruction Set Computing) • Virtual ISAs: designed for translation, not direct execution RISC (Reduced Instruction Set Computing) • Target for high-level compiler (one per language) • Source for low-level translator (one per ISA) insns cycles seconds other • Examples: Java Bytecodes, C# CLR (Common Language program insn cycle Runtime) • Transmeta’s Code morphing: x86 translation in software CISC • Only “code morphing” translation software written in native ISA • Native ISA is invisible to applications and even OS RISC • Guess who owns this technology now?

59 60

RISC & CISC for Performance RISC & CISC for Performance

Recall performance equation: Recall performance equation: seconds instructions x cyclesx seconds seconds instructions x cyclesx seconds program = program instruction cycle program = program instruction cycle

CISC (Complex Instruction Set Computing) CISC (Complex Instruction Set Computing) RISC (Reduced Instruction Set Computing) RISC (Reduced Instruction Set Computing)

insns cycles seconds insns cycles seconds other other program insn cycle program insn cycle + Easy for assembly- + Easy for assembly- CISC ↓ ↑↑level programmers CISC ↓ ↑↑level programmers + good code density + good code density ↑ ↓ ↑ ↓ RISC + smart compilers can RISC + smart compilers can hopefully not ↓ if designed help with insns/program hopefully not ↓ if designed help with insns/program too much aggressively too much aggressively

61 62