0915 RISC-V 50 Years Computer Arch

Home , Control store, Cycles per instruction, Microcode

50 years of Computer Architecture

David Patterson [email protected] http://www.riscv.org May, 2017

1 IBM Compatibility Problem in early 1960s

By early 1960’s, IBM had 4 incompatible lines of computers! 701 ® 7094 650 ® 7074 702 ® 7080 1401 ® 7010

Each system had its own • Instruction set • I/O system and Secondary Storage: magnetic tapes, drums and disks • Assemblers, compilers, libraries,... • Market niche: business, scientific, real time, ...

Þ IBM System/360 – one ISA to rule them all 2 IBM 360: A Computer Family

Model 30 . . . Model 70 Storage 8K - 64 KB 256K - 512 KB Datapath 8-bit 64-bit Circuit Delay 30 nsec/level 5 nsec/level Registers Main Store Transistor Registers The IBM 360 is why bytes are 8-bits long today! IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models. Milestone: The first true ISA designed as portable hardware- software interface!

With minor modifications it still survives today! 3 IBM System/360 Reference Card (“Green card”)

OR (cl OC m' FLOA~NEP~INTFEATURE 'INSTRUCTIONS' Sy~t8m/~~ PaGk PAW F2 I%- Dinn4b#) RDD 86 AddN~allted,ExtenddJc.x) . , AXR 36 , RR I R~hrmcaData Add Normdized, Long (c] IBM SM-m MsDk (n) SPM 04 ADR 2A RR Srorlge Key 4e.d Fdd Normdid, LO~BIc) , AD 8A RX - SSK 08 I @ at %s@~nW fpi SSM 80 Add Norrndized, Short (01 AER 3A RR ' MACHINE IWTRUCTIONhS 1 Shift Left Dwble (c) SLDA 8F add Nonnalizad, Short Id AE 7A RX shift bft~oubfe ~owical SLDL 8D Add Unnormalized, Long (c) AWR 2E RR IUm MlWDlllG *;hm~%!#efc) SLA 88 Add Unnormaltzad, Long (c) AW 6E RX Add Unnomralf*ed, Short (c) AUR 3E RR Apkl (0) AR SMLM Single Logical SLL 89 Wbc3 A RWt Dmlbk bl SRDA 8E Add Unnomalited, WrtICY AU 7E KX Add Deeimd (e,d) AP Shift RwtDoubh Lagical SRDL 8C Compare, Low (d CDR 29 RR Akl Halfvvarel (cl AH ah R@ht $Me Id SRA 8A Cmpm Long (c) CD 69 RX Compare, Short (c) CER 39 RR Add Logical (d ALR Wcl Rwt Qry& Logical SRL 88 IfQ Y)Qd 510 9C Compare, Short Ic) CE 76 RX add CDBical (d AL DiviL, Long DDR 2D RR AND (c) NR Sore Sf 50 Dilde, Long DD 60 RX ' AND (c) N store C~Q~~=QC STC 42 AND (cl Ni St- hl&wd STH So Divide, Short DER 30 RR DM&, short DE 7D RX AND (4 Nc m 90 Bmmh and Link S£mMuSrir Oontrol (eip) STMC 60 Wve. Long HDR 24 RR BALR Halve, Shon HER 34 RR Branch md Link BAL fbbmlmU-- SR 16 Load and Teat, Long (c) LTDR RR Branch adStare (el WR SdNmft td S 68 a Load md Test, Short (c) LTER 32 RR Branch Md Srwe (el BAS 6ubfm&99dm@ lc,d SP FB Load Conplment, Long (c) LCDR 23 RR I BrMCGonCondftion rnR QbtFW Wdfword (6) SH 48 Branah on C;ondEth BIC 8k4wW t&!&d k) StR IF Load Complement Short (cl LCER RR BfZmhahCeunt BCTA wm@Loeieal (el St SF L-4 Long LDR 28 RR Bmnch on count 867 slwbuk%~m SVC OA Load, Long LD 68 RX on Idem H5gh 8XH mMdwl(c) TS 93 Load NNegative. Long (cl LNDR 21 RR BnncSronlhdexL~arEqusl BXLE 'fMIW4%~) TCH SF Load Negative, SfwR (c) LNER 31 RR CR %Hi€l& h1 TI0 9D Load Positive, Long (c) LPDR 20 RR Conrpors (4 Load Padtiva, Shart (c) LPER RR Campare C ~~.tr~!Wlvsk (cr) ' TM 91 30 rc) lQad Rwnded, Extended to Long (x) LRDR 26 RR fawxm Dschapd bJ) bP TR DC cMns1*rr,HsWwdnd(cr) ct.E nambbddT~ TPlT OD Load Rounded, Long to Short Lx) LRER 3S RR Laad, Short LER 38 RR Compnn hmtat CCR m UNPK F3 CL WRD 84 Load, Short LE 78 RX eongrr?oLe?iwW w@Em&? ~~~ Extended 1x1 MXR 26 RR '* (GRf) ZAP F8 Multiply. Wnmv bqhall I4 CLC ZmQ Multiply, tong MDR 2C RR ' CL4 . -mm Laded Id Multiply, Long , MD 6C RX CV& NQTW @OWPAWLS 1-3 Co-t(P~ code is load& Multiply, LongIExtended. (XI MXDR RR cwwM.ta kwmal CVD r ~mrmttafemm d. Decimel feamte n & Waa$oanW fentwm e. Model67 Privileged instruction MulrOpIy, LonaIExtended (x) mu 87 RX mwm- I(d Multiply, Short DhFkk m o. km&tifmctoFla it set n. New condition Extended pwcision * MER 3C RR Mi D floating point feature Multiply, Short ME 7C RX Mvlrs *al Idl w Store, Long STD 60 RX ~ditjt?,d)\ ED #i&CMl#E FORMATS ST€ 70 RX Rl,D2lX2,B2) EWC Editarralnkk 0 H&@WQFID 1 SECOND HRLFWORD 2 THIRD HALFWORD 3 E&qh 951 tc) XR 1 1 SDR 28 RR R1.W Em#wiv# OR dc) X SD 68 RX Rl.M(X2.62~ Rl.R2 Ewhsb OR Ce) ' SER 38 RR xi SE 78 RX Rl,D2tX2,02) I EfdwbmQA le) XC E&-Y EX llQ) tc&J HI0 fnaort Ghlmer IC KgY (w) 1SK Lwd LR --LBlu$ L LQadMdw8, LA eapdaml Twrt (4 LTR Lod lzawment (c) LCR Lordxdiliword LH mu^ I& Lod IrduWe Coharol (@a) LMC ' ~~w(C) LNR Loai aorith 4W LPR Losd PSW lFl(h,p) LPmY Losd Reid Ackkm (c,e$l LRA MVI MVC - WN !Eswmda ~uswithOfkDIst AII"VQ MamzoMa MWZ MulOb MR value of addms MMl*lv M Mu* Decbd tdb w AfiutalplVHsitmxd MH aR OR OR fc) 0 OR (d 01 4 Control versus Datapath

▪Processor designs can be split between datapath, where numbers are stored and arithmetic operations computed, and control, which sequences operations on datapath ▪Biggest challenge for early computer designers was getting control circuitry correct Control § Maurice Wilkes invented the Instruction Control Lines Condition? idea of microprogramming to design the control unit of a processor, 1958 - Logic expensive compared to PC

ALU ROM or RAM Datapath Registers Inst. Reg. - ROM cheaper than RAM - ROM much faster than RAM Busy? Address Data

Main Memory

5 Microprogramming in IBM 360

Model M30 M40 M50 M65 Datapath width 8 bits 16 bits 32 bits 64 bits Microcode size 4k x 50 4k x 52 2.75k x 85 2.75k x 87 Clock cycle time (ROM) 750 ns 625 ns 500 ns 200 ns Main memory cycle time 1500 ns 2500 ns 2000 ns 750 ns Annual rental fee (1964 $) $48,000 $54,000 $115,000 $270,000 Annual rental fee (2015 $) $570,000 $650,000 $1,400,000 $3,200,000

6 IC technology, Microcode, and CISC

▪ Logic, RAM, ROM all implemented using MOS transistors ▪ Semiconductor RAM ≈ same speed as ROM ▪ With Moore’s Law, memory for control store could grow ▪ Allowed more complicated instruction sets (CISC) ▪ Minicomputer (TTL server) Example: Digital Equipment Corp. (DEC) VAX ISA in 1978

7 Microprocessor Evolution

▪ Rapid progress in 1970s, fueled by advances in MOS technology, imitated minicomputers and mainframe ISAs ▪ Intel i432 - Most ambitious 1970s micro - started in 1975 - released 1981 - 32-bit capability-based object-oriented architecture - Instructions variable number of bits long - Heavily microcoded - Severe performance, complexity, and usability problems ▪ Intel 8086 (1978, 8MHz, 29,000 transistors) - “Stopgap” 16-bit processor, architected in 10 weeks - Extended accumulator architecture - Assembly-compatible with 8080 - 20-bit addressing through segmented addressing scheme ▪ IBM PC 1981 picks Intel 8088 for 8-bit bus (and Motorola 68000 was late) - Estimated sales of 250,000; 100,000,000s sold 8 8 Analyzing Microcoded Machines 1980s ▪ John Cocke and group at IBM - Working on a simple pipelined processor, 801 minicomputer (ECL server), and advanced compilers inside IBM - Ported experimental PL.8 compiler to IBM 370, only used simple register-register and load/store instructions similar to 801 - Code ran faster than other existing compilers that used all 370 instructions! - Up to 6 MIPS whereas 2 MIPS considered good before ▪ Emer and Clark at DEC in early 1980s - Found 20% of VAX instructions responsible for 60% of microcode, but only account for 0.2% of execution time! ▪ Patterson 1979 sabbatical at DEC - VAX microcode bugs ⇒ field repair, but field-repairable chips don’t make sense 9 From CISC to RISC

▪ Use fast RAM to build fast instruction cache of user- visible instructions, not fixed hardware microroutines - Contents of fast instruction memory change to fit what application needs right now ▪ Use simple ISA to enable hardwired pipelined implementation - Compiled code only used a few CISC instructions - Simpler encoding allowed pipelined implementations ▪ Further benefit with integration - In early ‘80s, could finally fit 32-bit datapath + small caches on a single chip - No chip crossings in common case allows faster operation

10 “Iron Law” of Processor Performance

Time = Instructions Clock cycles __Time___ Program Program * Instruction * Clock cycle

▪ Instructions per program depends on source code, compiler technology, and ISA ▪ Clock cycles per instructions (CPI) depends on ISA and underlying microarchitecture ▪ Time per clock cycle depends upon the microarchitecture and base technology ▪ RISC executes more instructions per program, but many fewer clock cycles per instruction (CPI) ⇒ RISC faster than CISC

11 Berkeley RISC Chips RISC-I (1982) Contains 44,420 transistors, fabbed in 5 µm NMOS, with a die area of 77 mm2, ran at 1 MHz

RISC-II (1983) contains 40,760 transistors, was fabbed in 3 µm NMOS, ran at 3 MHz, and the size is 60 mm2

12 Stanford RISC Project: MIPS

Stanford MIPS (1983) contains 25,000 transistors, was fabbed in 3 µm & 4 µm NMOS, ran at 4 MHz (3 µm ), and size is 50 mm2 (4 µm) (Microprocessor without Interlocked Pipeline Stages) 13 CISC vs. RISC Today

● PC Era ● PostPC Era: Client/Cloud ● Hardware translates ● IP in SoC vs. MPU x86 instructions into ● Value die area, energy as internal RISC much as performance instructions ● > 18B / year in 2016! ● Then use any RISC ● 98% RISC Processors: technique inside MPU 15.0B ARM (Advanced ● > 350M / year ! RISC Machine) ● x86 ISA eventually 2.0B Tensilica dominates servers as 1.5B ARC (Argonaut well as desktops RISC Core) 0.8B MIPS 14 VLIW: Very Long Instruction Word

Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency ▪ Multiple operations packed into one instruction ▪ Each operation slot is for a fixed function ▪ Constant operation latencies are specified ▪ Architecture requires guarantee of: - Parallelism within an instruction => no cross-operation RAW check - No data use before data ready => no data interlocks

15 VLIW Compiler Responsibilities ▪ Schedule operations to maximize parallel execution

▪ Guarantees intra-instruction parallelism

▪ Schedule to avoid data hazards (no interlocks) - Typically separates operations with explicit NOPs

16 Loop Unrolling for (i=0; i

17 Scheduling Loop Unrolled Code Unroll 4 ways loop: fld f1, 0(x1) Int1 Int 2 M1 M2 FP+ FPx fld f2, 8(x1) loop fld f3, 16(x1) : fld f1 fld f4, 24(x1) fld f2 add x1, 32 fld f3 fadd f5, f0, f1 add x1 fld f4 fadd f5 fadd f6, f0, f2 Schedule fadd f6 fadd f7, f0, f3 fadd f7 fadd f8, f0, f4 fadd f8 fsd f5, 0(x2) fsd f5 fsd f6, 8(x2) fsd f6 fsd f7, 16(x2) fsd f7 fsd f8, 24(x2) add x2 bne fsd f8 add x2, 32 bne x1, x3, loop How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36 18 Intel Itanium, EPIC IA-64 ▪ EPIC is the style of architecture - “Explicitly Parallel Instruction Computing” - A binary object-code-compatible VLIW - Developed jointly with HP ▪ IA-64 was Intel’s chosen 64b ISA successor to 32b x86 - IA-64 = Intel Architecture 64-bit - AMD wouldn’t be able to make, unlike x86 ▪ Intel Merced was first Itanium implementation - 1st customer shipment expected 1997 (actually 2001) - McKinley, 2nd implementation, 180 nm, shipped in 2002 - Poulson, most recent, 8 cores, 32 nm, shipped in 2012

19 VLIW Issues and an “EPIC Failure” ▪ Unpredictable branches ▪ Variable memory latency (unpredictable cache misses) ▪ Code size explosion ▪ Compiler complexity: “The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.” - Donald Knuth, Stanford ▪ Columnist Ashlee Vance noted delays and under performance of Itanium “turned the product into a joke in the chip industry”

Itanimum => “Itanic” (like infamous ship Titanic) 20 Consensus on ISAs Today

▪ Not CISC: no new commercial CISC ISAs in 30+ years ▪ Not VLIW: Despite several attempts, VLIW has failed in general-purpose computing arena - Complex VLIW architectures close to in-order superscalar in complexity, no real advantage on large complex apps - Although VLIWs successful in embedded DSP market (Simpler VLIWs, more constrained, friendlier code) ▪ RISC! Widespread agreement (still) that RISC principles are best for general purpose ISA

21 2000s: How Should We Build Scalable Multiprocessors? 1. Shared Memory with "Non Uniform Memory Access" time (NUMA) using loads and stores - Distributed directory remembers sharing for coherency and consistency - DASH/FLASH projects at Stanford (1992-2000) 2. Message passing Cluster with separate address space per processor using RPC (or MPI) - Collection of independent computers connected by LAN switches to provide a common service - Network of Workstations project at Berkeley (1993-1998) 22 2000s: How Should We Build Scalable Multiprocessors?

Switch

CPU CPU CPU CPU Mem Mem … Mem Mem

23 SGI Origin 2800 NUMA vs. NOW-2

● Up to 128 CPUs ● 105 Sun workstations ● Scalable BW crucial (w. 2 disks) + Myrinet ● 12 Gbit/s Numalink2 switched system area ● Scientific computation network 1 Gbit/s per link ● ≈$10M ● Search + Scientific ● ≈$1M 24 NUMA Advantages

▪ Ease of programming when communication patterns are complex or vary dynamically during execution

▪ Ability to develop apps using familiar SMP model

▪ Lower communication overhead, better use of BW for small items due to implicit communication

▪ HW-controlled caching to reduce remote communication by caching of all data

25 Cluster Drawbacks

▪ Cost of administering a cluster of N machines ~ administering N independent machines vs. cost of administering a shared address space N processors multiprocessor ~ administering 1 big machine ▪ Clusters usually connected using I/O bus, whereas multiprocessors usually connected on memory bus ▪ Cluster of N machines has N independent memories and N copies of OS and code, but a shared address multi-processor allows 1 program to use almost all memory

26 Cluster Advantages

▪ Error isolation: separate address space limits contamination of error

▪ Repair: Easier to replace a machine without bringing down the system

▪ Scale: easier to expand the system

▪ Cost: Large scale machine has low volume ⇒ fewer machines to spread development costs vs. leverage high volume off-the-shelf switches and computers

▪ Inktomi first then Amazon, AOL, Google, Hotmail, WebTV, Yahoo … relied on clusters of PCs to provide services used by millions of people every day 27 28 A Golden Age in Microprocessor Design Proprietary + Confidential

• Stunning progress in microprocessor design 40 years ≈ 106x faster! • Three architectural innovations (~1000x) • Width: 8->16->32 ->64 bit (~8x) • Instruction level parallelism: • 4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x) • Multicore: 1 processor to 16 cores (~16x)

• Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)

• Made possible by IC technology: • Moore’s Law: growth in transistor count (2X every 1.5 years) • Dennard Scaling: power/transistor shrinks at same rate as transistors are added (constant per mm2 of silicon)

Source: John Hennessy, “The Future of Microprocessors,”Source: Lorem ipsum dolorStanford sit amet, University,consectetur adipiscing March elit. Duis 16, non 201729 erat sem Changes Converge to End the Golden Age Proprietary + Confidential

• Technology • End of Dennard scaling: power becomes the key constraint • Slowdown (retirement) of Moore’s Law: transistors cost

• Architectural • Limitation and inefficiencies in exploiting instruction level parallelism end the uniprocessor era in 2004 • Amdahl’s Law and its implications end “easy” multicore era • Products • PC/Server ⇒ Client/Cloud

Source: John Hennessy, “The Future of Microprocessors,”Source: Lorem ipsum Stanford dolor sit amet, University, consectetur adipiscing March elit. 16,Duis 2017non30 erat sem End of Growth of Performance?

End of Moore’s Law Am- ⇒ dahl’s 2X / Law End of ⇒ 20 yrs (3%/yr) Dennard 2X / Scaling 6 yrs ⇒ (12%/yr) Multicore 2X / 3.5 RISC yrs CISC 2X / 1.5 (23%/yr) 2X / 3.5 yrs yrs (22%/yr) (52%/yr)

Based on SPECintCPU. Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 201831 What’s Left?

Since ▪ Transistors not getting much better ▪ Power budget not getting much higher ▪ Already switched from 1 inefficient processor/chip to N efficient processors/chip Only path left is Domain Specific Architectures ▪ Just do a few tasks, but extremely well ▪ E.g., Google Tensor Processing Unit (TPU) 30 – 80X performance/Watt vs GPU & CPU Want open, modular instruction set with good software stack and enough opcode space for many special purpose instructions 32 A Second Architecture Golden Age

▪ End of Dennard scaling and Moore’s Law => improving cost-energy-performance requires innovation in architecture ▪ Potential upside of domain-specific architectures and synergy with domain-specific programming languages - E.g., Google TensorFlow and Google TPU ▪ Innovators can demonstrate ideas with FPGAs and custom chips instead of software simulators - Reduced software non-recurring engineering costs with open instruction set (e.g., RISC-V) - Reduced hardware NRE with open IP blocks (FreeChips) - Better productivity via Agile hardware development & tools (e.g., Chisel) - Cheap test chips: 100 3mm2 TSMC chips for $US 33 30,000