RISC-V • 1980S: RISC • Case for Open Isas • 1990S: VLIW • Tour of RISC-V ISA • 2000S: NUMA Vs
Total Page:16
File Type:pdf, Size:1020Kb
Past and Future Trends in Architecture and Hardware David Paerson [email protected] SOSP History Day October 3, 2015 1 Outline Part I - Past Part II – Future 50 years of Computer HW Technology Architecture History: • End of Moore’s Law • 1960s: • Flash vs. Disks Computer Families / • Fast DRAM Microprogramming • Crosspoint NVRAM • 1970s: CISC Open ISA & RISC-V • 1980s: RISC • Case for Open ISAs • 1990s: VLIW • Tour of RISC-V ISA • 2000s: NUMA vs. • RISC-V Software Stack Clusters • RISC-V Chips 2 IBM Compatibility Problem in early 1960s By early 1960’s, IBM had 4 incompatible lines of computers! 701 → 7094 650 → 7074 702 → 7080 1401 → 7010 Each system had its own • Instruction set • I/O system and Secondary Storage: magnetic tapes, drums and disks • Assemblers, compilers, libraries,... • Market niche: business, scientific, real time, ... ⇒ IBM System/360 – one ISA to rule them all 3 IBM 360: A Computer Family Model 30 . Model 70 Storage 8K - 64 KB 256K - 512 KB Datapath 8-bit 64-bit Circuit Delay 30 nsec/level 5 nsec/level Registers Main Store Transistor Registers The IBM 360 is why bytes are 8-bits long today! IBM 360 instrucon set architecture (ISA) completely hid the underlying technological differences between various models. Milestone: The first true ISA designed as portable hardware- soKware interface! With minor modifica>ons it s>ll survives today! 4 IBM System/360 Reference Card (“Green card”) OR (cl OC m' FLOA~NEP~INTFEATURE 'INSTRUCTIONS' Sy~t8m/~~ PaGk PAW F2 I%- Dinn4b#) RDD 86 AddN~allted,ExtenddJc.x) . , AXR 36 , RR I R~hrmcaData Add Normdized, Long (c] IBM SM-m MsDk (n) SPM 04 ADR 2A RR Srorlge Key 4e.d Fdd Normdid, LO~BIc) , AD 8A RX - SSK 08 I @ at %s@~nW fpi SSM 80 Add Norrndized, Short (01 AER 3A RR ' MACHINE IWTRUCTIONhS 1 Shift Left Dwble (c) SLDA 8F add Nonnalizad, Short Id AE 7A RX shift bft~oubfe ~owical SLDL 8D Add Unnormalized, Long (c) AWR 2E RR IUm MlWDlllG *;hm~%!#efc) SLA 88 Add Unnormaltzad, Long (c) AW 6E RX Add Unnomralf*ed, Short (c) AUR 3E RR Apkl (0) AR SMLM Single Logical SLL 89 Wbc3 A RWt Dmlbk bl SRDA 8E Add Unnomalited, WrtICY AU 7E KX Add Deeimd (e,d) AP Shift RwtDoubh Lagical SRDL 8C Compare, Low (d CDR 29 RR Akl Halfvvarel (cl AH ah R@ht $Me Id SRA 8A Cmpm Long (c) CD 69 RX Compare, Short (c) CER 39 RR Add Logical (d ALR Wcl Rwt Qry& Logical SRL 88 IfQ Y)Qd 510 9C Compare, Short Ic) CE 76 RX add CDBical (d AL DiviL, Long DDR 2D RR AND (c) NR Sore Sf 50 Dilde, Long DD 60 RX ' AND (c) N store C~Q~~=QC STC 42 AND (cl Ni St- hl&wd STH So Divide, Short DER 30 RR DM&, short DE 7D RX AND (4 Nc m 90 Bmmh and Link S£mMuSrir Oontrol (eip) STMC 60 Wve. Long HDR 24 RR BALR Halve, Shon HER 34 RR Branch md Link BAL fbbmlmU-- SR 16 Load and Teat, Long (c) LTDR RR Branch adStare (el WR SdNmft td S 68 a Load md Test, Short (c) LTER 32 RR Branch Md Srwe (el BAS 6ubfm&99dm@ lc,d SP FB Load Conplment, Long (c) LCDR 23 RR I BrMCGonCondftion rnR QbtFW Wdfword (6) SH 48 Branah on C;ondEth BIC 8k4wW t&!&d k) StR IF Load Complement Short (cl LCER RR BfZmhahCeunt BCTA wm@Loeieal (el St SF L-4 Long LDR 28 RR Bmnch on count 867 slwbuk%~m SVC OA Load, Long LD 68 RX on Idem H5gh 8XH mMdwl(c) TS 93 Load NNegative. Long (cl LNDR 21 RR BnncSronlhdexL~arEqusl BXLE 'fMIW4%~) TCH SF Load Negative, SfwR (c) LNER 31 RR CR %Hi€l& h1 TI0 9D Load Positive, Long (c) LPDR 20 RR Conrpors (4 Load Padtiva, Shart (c) LPER RR Campare C ~~.tr~!Wlvsk (cr) ' TM 91 30 rc) lQad Rwnded, Extended to Long (x) LRDR 26 RR fawxm Dschapd bJ) bP TR DC cMns1*rr,HsWwdnd(cr) ct.E nambbddT~ TPlT OD Load Rounded, Long to Short Lx) LRER 3S RR Laad, Short LER 38 RR Compnn hmtat CCR m UNPK F3 CL WRD 84 Load, Short LE 78 RX eongrr?oLe?iwW w@Em&? ~~~ Extended 1x1 MXR 26 RR '* (GRf) ZAP F8 Multiply. Wnmv bqhall I4 CLC ZmQ Multiply, tong MDR 2C RR ' CL4 . -mm Laded Id Multiply, Long , MD 6C RX CV& NQTW @OWPAWLS 1-3 Co-t(P~ code is load& Multiply, LongIExtended. (XI MXDR RR cwwM.ta kwmal CVD r ~mrmttafemm d. Decimel feamte n & Waa$oanW fentwm e. Model67 Privileged instruction MulrOpIy, LonaIExtended (x) mu 87 RX mwm- I(d Multiply, Short DhFkk m o. km&tifmctoFla it set n. New condition Extended pwcision * MER 3C RR Mi D floating point feature Multiply, Short ME 7C RX Mvlrs *al Idl w Store, Long STD 60 RX ~ditjt?,d)\ ED #i&CMl#E FORMATS ST€ 70 RX Rl,D2lX2,B2) EWC Editarralnkk 0 H&@WQFID 1 SECOND HRLFWORD 2 THIRD HALFWORD 3 E&qh 951 tc) XR 1 1 SDR 28 RR R1.W Em#wiv# OR dc) X SD 68 RX Rl.M(X2.62~ Rl.R2 Ewhsb OR Ce) ' SER 38 RR xi SE 78 RX Rl,D2tX2,02) I EfdwbmQA le) XC E&-Y EX llQ) tc&J HI0 fnaort Ghlmer IC KgY (w) 1SK Lwd LR --LBlu$ L LQadMdw8, LA eapdaml Twrt (4 LTR Lod lzawment (c) LCR Lordxdiliword LH mu^ I& Lod IrduWe Coharol (@a) LMC ' ~~w(C) LNR Loai aorith 4W LPR Losd PSW lFl(h,p) LPmY Losd Reid Ackkm (c,e$l LRA MVI MVC - WN !Eswmda ~uswithOfkDIst AII"VQ MamzoMa MWZ MulOb MR value of addms MMl*lv M Mu* Decbd tdb w AfiutalplVHsitmxd MH aR OR OR fc) 0 OR (d 01 5 Control versus Datapath ▪ Processor designs can be split between datapath, where numbers are stored and arithmetic operations computed, and control, which sequences operations on datapath ▪ Biggest challenge for early computer designers was getting control circuitry correct Control ! Maurice Wilkes invented the InstrucUon Control Lines Condion? idea of microprogramming to design the control unit of a processor, 1958 - Logic expensive compared to PC ALU ROM or RAM Datapath Registers Inst. Reg. - ROM cheaper than RAM - ROM much faster than RAM Busy? Address Data Main Memory 6 Microprogramming in IBM 360 Model M30 M40 M50 M65 Datapath width 8 bits 16 bits 32 bits 64 bits Microcode size 4k x 50 4k x 52 2.75k x 85 2.75k x 87 Clock cycle Ume (ROM) 750 ns 625 ns 500 ns 200 ns Main memory cycle Ume 1500 ns 2500 ns 2000 ns 750 ns Annual rental fee (1964 $) $48,000 $54,000 $115,000 $270,000 Annual rental fee (2015 $) $570,000 $650,000 $1,400,000 $3,200,000 7 IC technology, Microcode, and CISC ▪ Logic, RAM, ROM all implemented using MOS transistors ▪ Semiconductor RAM ≈ same speed as ROM ▪ With Moore’s Law, memory for control store could grow ▪ Allowed more complicated instruction sets (CISC) ▪ Minicomputer (TTL server) Example: Digital Equipment VAX ISA in 1978 8 Microprocessor Evolution ▪ Rapid progress in 1970s, fueled by advances in MOS technology, imitated minicomputers and mainframe ISAs ▪ Intel i432 - Most ambitious 1970s micro - started in 1975 - released 1981 - 32-bit capability-based object-oriented architecture - Instructions variable number of bits long - Heavily microcoded - Severe performance, complexity, and usability problems ▪ Intel 8086 (1978, 8MHz, 29,000 transistors) - “Stopgap” 16-bit processor, architected in 10 weeks - Extended accumulator architecture - Assembly-compatible with 8080 - 20-bit addressing through segmented addressing scheme ▪ IBM PC uses Intel 8088 for 8-bit bus (and Motorola 68000 was late) - Estimated sales of 250,000; 100,000,000s sold 9 Analyzing Microcoded Machines 1980s ▪ John Cocke and group at IBM - Working on a simple pipelined processor, 801 minicomputer (ECL server), and advanced compilers inside IBM - Ported experimental PL.8 compiler to IBM 370, only used simple register-register and load/store instructions similar to 801 - Code ran faster than other existing compilers that used all 370 instructions! - Up to 6 MIPS whereas 2 MIPS considered good before ▪ Emer and Clark at DEC - Found 20% of VAX instructions responsible for 60% of microcode, but only account for 0.2% of execution time! ▪ Patterson 1979 sabbatical at DEC - VAX microcode bugs ⇒ field repair, but field-repairable chips don’t make sense 10 From CISC to RISC ▪ Use fast RAM to build fast instruction cache of user- visible instructions, not fixed hardware microroutines - Contents of fast instruction memory change to fit what application needs right now ▪ Simple ISA => hardwired pipelined implementation - Compiled code only used a few CISC instructions - Simpler encoding allowed pipelined implementations ▪ Further benefit with integration - In early ‘80s, could finally fit 32-bit datapath + small caches on a single chip - No chip crossings in common case allows faster operation 11 Berkeley RISC Chips RISC-I (1982) Contains 44,420 transistors, fabbed in 5 µm NMOS, with a die area of 77 mm2, ran at 1 MHz. RISC-II (1983) contains 40,760 transistors, was fabbed in 3 µm NMOS, ran at 3 MHz, and the size is 60 mm2 Stanford built some too… 12 13 CISC vs. RISC Today ● PC Era ● PostPC Era: Client/Cloud ● Hardware translates ● IP in SoC vs. MPU x86 instructions into ● Value die area, energy as internal RISC much as performance instructions ● > 16B / year in 2014! ● Then use any RISC ● 98% RISC Processors: technique inside MPU 12.0B ARM (Advanced ● > 350M / year ! RISC Machine) ● x86 ISA eventually 2.0B Tensilica dominates servers as 1.5B ARC (Argonaut well as desktops RISC Core) 0.8B MIPS 14 VLIW: Very Long Instrucon Word Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floa>ng-Point Units, Four Cycle Latency ▪ MulUple operaons packed into one instrucUon ▪ Each operaon slot is for a fixed funcUon ▪ Constant operaon latencies are specified ▪ Architecture requires guarantee of: - Parallelism within an instrucUon => no cross-operaon RAW check - No data use before data ready => no data interlocks 15 VLIW Compiler ResponsibiliQes ▪ Schedule operaons to maximize parallel execuUon ▪ Guarantees intra-instrucUon parallelism ▪ Schedule to avoid data hazards (no interlocks) - Typically separates operaons with explicit NOPs