<<

Hot Topics in System Architecture

• 1950s and 1960s: Instruction Set Principles and – Computer Arithmetic Examples • 1970 and 1980s: – Instruction Set Design – ISA Appropriate for Compilers Chalermek Intanagonwiwat • 1990s: – Design of CPU – Design of memory system Slides courtesy of Graham Kirby, Mike Schulte, and – Instruction Set Extensions Peiyi Tang

Hot Topics in Instruction Set Computer Architecture (cont.) Architecture (ISA) • “Instruction set architecture is the • 2000s: structure of a computer that a machine – Computer Arithmetic – Design of I/O system language programmer must understand to – Parallelism write a correct (timing independent) program for that machine.”

– Source: IBM in 1964 when introducing the IBM 360 architecture, which eliminated 7 different IBM instruction sets. ISA (cont.) ISA (cont.)

• The instruction set architecture is also • The instruction set architecture the machine description that a hardware serves as the interface between software and hardware. designer must understand to design a correct implementation of the computer. • It provides the mechanism by which the software tells the hardware what should be done.

ISA (cont.) ISA Metrics • Orthogonality

High level language code : C, C++, Java, Fortan, – No special registers, few special cases, compiler all operand modes available with any Assembly language code: architecture specific statements data type or instruction type assembler Machine language code: architecture specific bit patterns • Completeness – Support for a wide range of operations and target applications software instruction set hardware ISA Metrics (cont.) Instruction Set Design Issues

• Regularity • Instruction set design issues include: – No overloading for the meanings of – Where are operands stored? instruction fields • registers, memory, stack, accumulator •Streamlined – How many explicit operands are there? – Resource needs easily determined •0, 1, 2, or 3 • Ease of compilation (or assembly – How is the operand location specified? language programming) • register, immediate, indirect, . . . • Ease of implementation

Instruction Set Design Issues Evolution of Instruction Sets

(cont.) Single Accumulator (EDSAC 1950) – What type & size of operands are Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) supported? • byte, int, float, double, string, vector. . . Separation of Programming Model from Implementation – What operations are supported? • add, sub, mul, move, compare . . . High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture (Vax, 1977-80) (CDC 6600, Cray 1 1963-76) RISC (Mips,Sparc,88000,IBM RS6000, . . .1987+) Classifying ISAs Classifying ISAs (cont.)

Accumulator (before 1960): Register-Memory (1970s to present):

1 address add A acc ← acc + mem[A] 2 address add R1, A R1 ← R1 + mem[A] load R1, A R1 ← mem[A] Stack (1960s to 1970s): Register-Register (Load/Store) 0 address add tos ← tos + next (1960s to present): Memory-Memory (1970s to 1980s): 3 address add R1, R2, R3 R1 ← R2 + R3 load R1, R2 R1 ← mem[R2] 2 address add A, B mem[A] ← mem[A] + mem[B] store R1, R2 mem[R1] ← R2 3 address add A, B, C mem[A] ← mem[B] + mem[C]

Classifying ISAs (cont.) Accumulator Architectures • Instruction set: add A, sub A, mult A, div A, load A, store A • Example: A*B - (A+C*B) load B mul C add A B B*C A+B*C A+B*C A A*B result store D load A mul B sub D Stack Architectures Accumulators: Pros and Cons • Instruction set: add, sub, mul, div, . . . •Pros push A, pop A – Very low hardware requirements • Example: A*B - (A+C*B) – Easy to design and understand push A push B •Cons mul – Accumulator becomes the bottleneck push A A B A*B AC B B*CA+B*Cresult A A*B A C AA*B – Little ability for parallelism or pipelining push C A*B A A*B – High memory traffic push B A*B mul add sub

Stacks: Pros and Cons Stacks: Pros and Cons (cont.) •Pros •Cons – Good code density (implicit top of stack) – Stack becomes the bottleneck – Low hardware requirements – Little ability for parallelism or pipelining – Easy to write a simpler compiler for stack – Data is not always at the top of stack when architectures need – Difficult to write an optimizing compiler for stack architectures Memory-Memory Architectures Memory-Memory: • Instruction set: Pros and Cons (3 operands) add A, B, C sub A, B, C •Pros mul A, B, C – Requires fewer instructions (2 operands) add A, B sub A, B mul A, B – Easy to write compilers for • Example: A*B - (A+C*B) – 3 operands 2 operands •Cons mul D, A, B mov D, A – Very high memory traffic mul E, C, B mul D, B – Variable number of clocks per instruction add E, A, E mov E, C – With two operands, more data movements are sub E, D, E mul E, B required add E, A sub E, D

Register-Memory Architectures Memory-Register: • Instruction set: Pros and Cons add R1, A sub R1, A mul R1, B •Pros load R1, A store R1, A – Some data can be accessed without loading first • Example: A*B - (A+C*B) – Instruction format easy to encode load R1, A – Good code density mul R1, B /* A*B */ •Cons store R1, D – Operands are not equivalent (poor orthogonal) load R2, C – Variable number of clocks per instruction mul R2, B /* C*B */ – May limit number of registers add R2, A /* A + CB */ sub R2, D /* AB - (A + C*B) */ Register-Register/Load-Store Register-Register/Load-Store: Architectures Pros and Cons • Instruction set: •Pros add R1, R2, R3 sub R1, R2, R3 mul R1, R2, R3 – Simple, fixed length instruction encodings load R1, A store R1, A move R1, R2 – Instructions take similar number of cycles • Example: A*B - (A+C*B) – Relatively easy to pipeline load R1, A load R2, B •Cons load R3, C – Higher instruction count mul R7, R3, R2 /* C*B */ add R8, R7, R1 /* A + C*B */ – Dependent on good compiler mul R9, R1, R2 /* A*B */ sub R10, R9, R8 /* A*B - (A+C*B) */

Registers: Registers: Advantages and Disadvantages Advantages and Disadvantages (cont.) •Advantages • Disadvantages – Faster than cache or main memory (no – Need to save and restore on procedure calls ) and context switch – Deterministic (no misses) – Can’t take the address of a register (for – Can replicate (multiple read ports) pointers) – Short identifier (typically 3 to 8 bits) – Fixed size (can’t store strings or structures efficiently) – Reduce memory traffic – Compiler must manage –Limited number Current Trends: Computation Byte Ordering Model • Practically every modern design uses a • Little Endian (e.g., in DEC, Intel) load-store architecture » low order byte stored at lowest address » byte0 byte1 byte2 byte3 • For a new ISA design: • Big Endian (e.g., in IBM, Motorolla, Sun, HP) – would expect to see load-store with plenty » high order byte stored at lowest address of general purpose registers » byte3 byte2 byte1 byte0 • Programmers/protocols should be careful when transferring binary data between Big Endian and Little Endian machines

Operand Alignment Unrestricted Alignment

• An access to an operand of size s bytes • If the architecture does not restrict at byte address A is said to be aligned memory accesses to be aligned then if – Software is simple A mod s = 0 – Hardware must detect misalignment and make two memory accesses

40 41 42 43 44 – Expensive logic to perform detection D0 D1 D2 D3 – Can slow down all references D0 D1 D2 D3 – Sometimes required for backwards compatibility Types of Addressing Modes Restricted Alignment (VAX)

• If the architecture restricts memory Addressing Mode Example Action accesses to be aligned then 1. Register direct Add R4, R3 R4 <- R4 + R3 – Software must guarantee alignment 2.Immediate Add R4, #3 R4 <- R4 + 3 3.Displacement Add R4, 100(R1) R4 <- R4 + M[100 + R1] – Hardware detects misalignment access and traps 4.Register indirect Add R4, (R1) R4 <- R4 + M[R1] 5.Indexed Add R4, (R1 + R2) R4 <- R4 + M[R1 + R2] – No extra time is spent when data is aligned 6.Direct Add R4, (1000) R4 <- R4 + M[1000] • Since we want to make the common case 7.Memory Indirect Add R4, @(R3) R4 <- R4 + M[M[R3]] fast, having restricted alignment is often a better choice, unless compatibility is an issue.

Types of Addressing Modes Use of Addressing Modes (cont.) (VAX)

8.Autoincrement Add R4, (R2)+ R4 <- R4 + M[R2] R2 <- R2 + d 9.Autodecrement Add R4, (R2)- R4 <- R4 + M[R2] R2 <- R2 - d 10.Scaled Add R4, 100(R2)[R3] R4 <- R4 + M[100 + R2 + R3*d] • Studies by [Clark and Emer] indicate that modes 1-4 account for 93% of all operands on the VAX.

• Displacement dominates the memory addressing mode (32% - 55%). • Immediate tails displacement (17% - 43%). Displacement Value Distribution Immediate Value Distribution

Current Trends: Memory Types of Operations Addressing • Arithmetic and Logic: AND, ADD • For a new ISA design would expect to see: – support for displacement, immediate and register • Data Transfer: MOVE, LOAD, STORE indirect addressing modes • Control BRANCH, JUMP, CALL »75% to 99% of SPEC measurements • System OS CALL – size of address for displacement to be at least 12-16 bits • Floating Point ADDF, MULF, DIVF – size of immediate field to be at least 8-16 bits • Decimal ADDD, CONVERT • As use of compilers becomes more dominant, emphasis is on simpler addressing modes • String MOVE, COMPARE • Graphics (DE)COMPRESS Instruction Frequency (Intel Current Trends: Operations 80x86 Integer) and Operands • For a new ISA design would expect to see: – support for 8, 16, 32, (64) bit integers – support for 32 and 64 bit IEEE 754 floating point – emphasis on efficient implementation of the simple common operations

Control Flow Addressing Modes

• Need to specify destination address – usually explicitly »displacement relative to PC (why?) – sometimes not known statically »specify register containing address Comparisons in Conditional Jump Distances (Alpha) Branches (Alpha)

Procedure Call/Return Procedure Call/Return (cont.)

• Need to save return address somewhere •Caller save (minimum) – calling procedure saves registers it will need – special register or GPR afterwards • CISC architectures tended to provide • Callee save more complex support – called procedure saves registers it needs to – e.g. saving whole batch of registers use – now just get the compiler to generate code for this CALLS Instruction (VAX) CALLS Instruction (cont.)

• Callee save 5. Clear condition codes 1. Align stack if needed 6. Push status word 2. Push argument count on stack 7. Update stack pointers 3. Save registers indicated by 8. Branch to first instruction procedure call mask on stack 4. Push return address on stack, push top & base pointers

Current Trends: Control Flow Encoding Instruction Set

• For a new ISA design would expect to • Encoding of instructions affects: see: – size of compiled program – conditional branches able to jump hundreds – decoding work required by processor of instructions forwards or backwards • Depends on: »PC-relative branch displacement of at – range of addressing modes least 8 bits – degree of independence between opcodes – jumps using PC-relative and register and modes indirect addressing Three Competing Forces Three Basic Variations

• Desire to have as many registers and • Variable instruction length (VAX) addressing modes as possible – independence between addressing mode and • Desire to reduce the size of instructions opcode (operation) and programs – use address specifier: (mode, reg) • Desire to ease the decoding and • Fixed instruction length (MIPS) implementation of instructions – small number of addressing modes – use opcode to imply the addressing mode •Hybrid approach

Performance Ratio between MIPS Three Basic Variations (cont.) and VAX Current Trends: Encoding Why does MIPS beat VAX? Instruction Set

•ICMIPS is about 2*ICVAX • For a new ISA design would expect to see: •CPIMIPS is about CPIVAX/6 • MIPS is about 3 times as fast as VAX – fixed length instruction encoding, or hybrid given the same clock cycle times – choice will be dictated by previous design decisions

The Role of Compiler The Role of Compiler (cont.)

• Why is compiler important to computer • Compiler optimization affects the designers? performance of the code generated. • Instruction set architecture is the – Instruction set architecture should allow to simplify compilers. efficient code. target of compiler. – Computer designers should know the – Instruction set architecture should allow limitation of compiler optimization. compilers to generate efficient code. Compiler Structure Compiler Goals

•Correctness • Speed of compiled code • Speed of compiler • Debugging support

Optimizations Optimizations (cont.)

• High-level • Global: across branches, loop optimization – Procedure integration: procedure in-lining – Global common expression elimination (13%) • Local: within basic block (linear code – Copy propagation (11%) fragment) – Code motion (16%) – Common expression elimination (18%) – Induction variable elimination (2%) – Constant propagation (22%) • Machine-Dependent – Stack height reduction – Strength reduction – Pipeline scheduling – Branch offset optimization Limitation of Compiler Effectiveness of Optimization Optimization • Phase-ordering problem – too hard, and expensive, to undo transformation steps – high-level transformations carried out before form of resulting code is known • Alias prevents allocating registers to variables

How the Architect Can Help How the Architect Can Help the Compiler Writer the Compiler Writer (cont.)

• Factors behind compiler complexity: • Graph coloring for register allocation is – programs are locally simple but globally more efficient with more than 16 complex registers. – structure of compilers means optimization • Alias limits the use of large number of decisions are made linearly registers • Make the frequent cases fast and the rare case correct... How the Architect Can Help the Compiler Writer (cont.)

• Desirable ISA properties –Regularity • Orthogonality among addressing mode, data types, and operations – Simplicity • Provide primitives, not solutions – Simplify tradeoffs • That the compiler has to make – Provide instructions that bind the quantities known at compile-time