A New Golden Age for Computer Architecture

turing lecture

DOI:10.1145/3282307 Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way.

BY JOHN L. HENNESSY AND DAVID A. PATTERSON A New Golden Age for Computer Architecture

WE BEGAN OUR Turing Lecture June 4, 201811 with a review of computer architecture since the 1960s. In addition to that review, here, we highlight current challenges and identify future opportunities, projecting another golden age for the field of computer architecture in the next decade, much like the 1980s when we did the research that led to our award, delivering gains in cost, engineers, including ACM A.M. Tur- energy, and security, as well as performance. ing Award laureate Fred Brooks, Jr., thought they could create a single ISA that would efficiently unify all four of “Those who cannot remember the past are condemned these ISA bases. to repeat it.” —George Santayana, 1905 They needed a technical solution for how computers as inexpensive as Software talks to hardware through a vocabulary key insights

called an instruction set architecture (ISA). By the early ˽˽ Software advances can inspire 1960s, IBM had four incompatible lines of computers, architecture innovation. ˽˽ Elevating the hardware/software each with its own ISA, software stack, I/O system, interface creates opportunities for and market niche—targeting small business, large architecture innovation. ˽˽ The marketplace ultimately settles business, scientific, and real time, respectively. IBM architecture debates.

48 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 those with 8-bit data paths and as fast one control line, each row was a mi- ly 6. The most expensive computers as those with 64-bit data paths could croinstruction, and writing microin- had the widest control stores because share a single ISA. The data paths are structions was called microprogram- more complicated data paths used the “brawn” of the processor in that ming.39 A control store contains an more control lines. The least-costly they perform the arithmetic but are rela- ISA interpreter written using micro- computers had narrower control tively easy to “widen” or “narrow.” The instructions, so execution of a con- stores due to simpler hardware but greatest challenge for computer de- ventional instruction takes several mi- needed more microinstructions since signers then and now is the “brains” croinstructions. The control store was they took more clock cycles to execute of the processor—the control hard- implemented through memory, which a System/360 instruction. ware. Inspired by software program- was much less costly than logic gates. Facilitated by microprogramming, ming, computing pioneer and Turing The table here lists four models IBM bet the future of the company laureate Maurice Wilkes proposed of the new System/360 ISA IBM an- that the new ISA would revolutionize how to simplify control. Control was nounced April 7, 1964. The data paths the computing industry and won the specified as a two-dimensional ar- vary by a factor of 8, memory capacity bet. IBM dominated its markets, and ray he called a “control store.” Each by a factor of 16, clock rate by nearly 4, IBM mainframe descendants of the

ILLUSTRATION BY PETER CROWTHER ASSOCIATES BY ILLUSTRATION column of the array corresponded to performance by 50, and cost by near- computer family announced 55 years

FEBRUARY 2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 49 turing lecture

ago still bring in $10 billion in rev- ated for the Xerox Palo Alto Research operating system written in the then- enue per year. Center in 1973. It was indeed the first new programming language Ada. As seen repeatedly, although the personal computer, sporting the first This ambitious project was alas sev- marketplace is an imperfect judge of bit-mapped display and first Ethernet eral years late, forcing Intel to start an technological issues, given the close local-area network. The device control- emergency replacement effort in Santa ties between architecture and com- lers for the novel display and network Clara to deliver a 16-bit microproces- mercial computers, it eventually deter- were microprograms stored in a 4,096- sor in 1979. Intel gave the new team 52 mines the success of architecture inno- word × 32-bit WCS. weeks to develop the new “8086” ISA vations that often require significant Microprocessors were still in the and design and build the chip. Given engineering investment. 8-bit era in the 1970s (such as the In- the tight schedule, designing the ISA Integrated circuits, CISC, 432, 8086, tel 8080) and programmed primarily took only 10 person-weeks over three IBM PC. When computers began us- in assembly language. Rival design- regular calendar weeks, essentially by ing integrated circuits, Moore’s Law ers would add novel instructions to extending the 8-bit registers and in- meant control stores could become outdo one another, showing their ad- struction set of the 8080 to 16 bits. The much larger. Larger memories in turn vantages through assembly language team completed the 8086 on schedule allowed much more complicated ISAs. examples. but to little fanfare when announced. Consider that the control store of the Gordon Moore believed Intel’s To Intel’s great fortune, IBM was VAX-11/780 from Digital Equipment next ISA would last the lifetime of developing a personal computer to Corp. in 1977 was 5,120 words × 96 Intel, so he hired many clever com- compete with the Apple II and needed bits, while its predecessor used only puter science Ph.D.’s and sent them a 16-bit microprocessor. IBM was in- 256 words × 56 bits. to a new facility in Portland to invent terested in the Motorola 68000, which Some manufacturers chose to make the next great ISA. The 8800, as Intel had an ISA similar to the IBM 360, but microprogramming available by let- originally named it, was an ambi- it was behind IBM’s aggressive sched- ting select customers add custom tious computer architecture project ule. IBM switched instead to an 8-bit features they called “writable control for any era, certainly the most ag- bus version of the 8086. When IBM an- store” (WCS). The most famous WCS gressive of the 1980s. It had 32-bit nounced the PC on August 12, 1981, the computer was the Alto36 Turing laure- capability-based addressing, ob- hope was to sell 250,000 PCs by 1986. ates Chuck Thacker and Butler Lamp- ject-oriented architecture, variable- The company instead sold 100 million son, together with their colleagues, cre- bit-length instructions, and its own worldwide, bestowing a very bright future on the emergency replacement Features of four models of the IBM System/360 family; IPS is instructions per second. Intel ISA. Intel’s original 8800 project was Model M30 M40 M50 M65 renamed iAPX-432 and finally an- Datapath width 8 bits 16 bits 32 bits 64 bits nounced in 1981, but it required sev- Control store size 4k x 50 4k x 52 2.75k x 85 2.75k x 87 eral chips and had severe performance Clock rate 1.3 MHz 1.6 MHz 2 MHz 5 MHz problems. It was discontinued in 1986, (ROM cycle time) (750 ns) (625 ns) (500 ns) (200 ns) the year after Intel extended the 16- Memory capacity 8–64 KiB 16–256 KiB 64–512 KiB 128–1,024 KiB bit 8086 ISA in the 80386 by expand- Performance (commercial) 29,000 IPS 75,000 IPS 169,000 IPS 567,000 IPS ing its registers from 16 bits to 32 bits. Performance (scientific) 10,200 IPS 40,000 IPS 133,000 IPS 563,000 IPS Moore’s prediction was thus correct Price (1964 $) $192,000 $216,000 $460,000 $1,080,000 that the next ISA would last as long as Price (2018 $) $1,560,000 $1,760,000 $3,720,000 $8,720,000 Intel did, but the marketplace chose the emergency replacement 8086 rather than the anointed 432. As the archi- Figure 1. University of California, Berkeley, RISC-I and Stanford University MIPS tects of the Motorola 68000 and iAPX- microprocessors. 432 both learned, the marketplace is rarely patient. From complex to reduced instruction set computers. The early 1980s saw several investigations into complex instruction set computers (CISC) enabled by the big microprograms in the larger control stores. With Unix demonstrating that even operating systems could use high-level languages, the critical question became: “What instructions would compilers generate?” instead of “What assembly language would programmers use?” Significant- ly raising the hardware/software inter-

50 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 turing lecture face created an opportunity for archi- datapath, along with instruction and tecture innovation. data caches, in a single chip. Turing laureate John Cocke and his For example, Figure 1 shows the colleagues developed simpler ISAs and RISC-I8 and MIPS12 microprocessors compilers for minicomputers. As an developed at the University of Califor- experiment, they retargeted their re- In today’s post-PC nia, Berkeley, and Stanford University search compilers to use only the simple era, x86 shipments in 1982 and 1983, respectively, that register-register operations and load- demonstrated the benefits of RISC. store data transfers of the IBM 360 ISA, have fallen almost These chips were eventually presented avoiding the more complicated instruc- 10% per year since at the leading circuit conference, the tions. They found that programs ran up IEEE International Solid-State Circuits to three times faster using the simple the peak in 2011, Conference, in 1984.33,35 It was a re- subset. Emer and Clark6 found 20% of markable moment when a few gradu- the VAX instructions needed 60% of the while chips with ate students at Berkeley and Stanford microcode and represented only 0.2% RISC processors could build microprocessors that were of the execution time. One author (Pat- arguably superior to what industry terson) spent a sabbatical at DEC to have skyrocketed could build. help reduce bugs in VAX microcode. If to 20 billion. These academic chips inspired microprocessor manufacturers were many companies to build RISC micro- going to follow the CISC ISA designs processors, which were the fastest for of the larger computers, he thought the next 15 years. The explanation is they would need a way to repair the due to the following formula for pro- microcode bugs. He wrote such a cessor performance: paper,31 but the journal Computer Time/Program = Instructions / rejected it. Reviewers opined that it was Program × (Clock cycles) / a terrible idea to build microproces- Instruction × Time / (Clock cycle) sors with ISAs so complicated that they DEC engineers later showed2 that needed to be repaired in the field. That the more complicated CISC ISA execut- rejection called into question the value ed about 75% of the number instruc- of CISC ISAs for microprocessors. Iron- tions per program as RISC (the first ically, modern CISC microprocessors term), but in a similar technology CISC do indeed include microcode repair executed about five to six more clock mechanisms, but the main result of his cycles per instruction (the second paper rejection was to inspire him to term), making RISC microprocessors work on less-complex ISAs for micro- approximately 4× faster. processors—reduced instruction set Such formulas were not part of com- computers (RISC). puter architecture books in the 1980s, These observations and the shift to leading us to write Computer Architec- high-level languages led to the opportu- ture: A Quantitative Approach13 in 1989. nity to switch from CISC to RISC. First, The subtitle suggested the theme of the the RISC instructions were simplified book: Use measurements and bench- so there was no need for a microcod- marks to evaluate trade-offs quanti- ed interpreter. The RISC instructions tatively instead of relying more on the were typically as simple as microin- architect’s intuition and experience, as structions and could be executed di- in the past. The quantitative approach rectly by the hardware. Second, the we used was also inspired by what Tur- fast memory, formerly used for the ing laureate Donald Knuth’s book had microcode interpreter of a CISC ISA, done for algorithms.20 was repurposed to be a cache of RISC VLIW, EPIC, Itanium. The next ISA instructions. (A cache is a small, fast innovation was supposed to succeed memory that buffers recently execut- both RISC and CISC. Very long instruc- ed instructions, as such instructions tion word (VLIW)7 and its cousin, the are likely to be reused soon.) Third, explicitly parallel instruction computer register allocators based on Gregory (EPIC), the name Intel and Hewlett Chaitin’s graph-coloring scheme made Packard gave to the approach, used wide it much easier for compilers to efficient- instructions with multiple independent ly use registers, which benefited these operations bundled together in each register-register ISAs.3 Finally, Moore’s instruction. VLIW and EPIC advocates Law meant there were enough transis- at the time believed if a single instruc- tors in the 1980s to include a full 32-bit tion could specify, say, six independent

FEBRUARY 2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 51 turing lecture

Figure 2. Transistors per chip of Intel microprocessors vs. Moore’s Law. tion of the RISC microinstructions. Any ideas RISC designers were using Moore’s Law vs. Intel Microprocessor Density for performance—separate instruc- Moore’s Law (1975 version) Density tion and data caches, second-level 10,000,000 caches on chip, deep pipelines, and 1,000,000 fetching and executing several instructions simultaneously—could 100,000 then be incorporated into the x86. 10,000 AMD and Intel shipped roughly 350 million x86 microprocessors annually 1,000 at the peak of the PC era in 2011. The 100 high volumes and low margins of the PC industry also meant lower prices 10 than RISC computers. Given the hundreds of millions 1980 1990 2000 2010 of PCs sold worldwide each year, PC software became a giant market. Whereas software providers for the 2 Figure 3. Transistors per chip and power per mm . Unix marketplace would offer different software versions for the differ-

Technology (nm) Power/nm2 ent commercial RISC ISAs—Alpha, 200 4.5 HP-PA, MIPS, Power, and SPARC—the 180 4 2 PC market enjoyed a single ISA, so 160 3.5 software developers shipped “shrink 140 3 120 wrap” software that was binary com- 2.5 100 patible with only the x86 ISA. A much 80 2 larger software base, similar perfor- Nanometers 60 1.5 mance, and lower prices led the x86 40 1 to dominate both desktop computers 20 0.5 per nm Power Relative and small-server markets by 2000. 0 0 Apple helped launch the post-PC 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 era with the iPhone in 2007. Instead of buying microprocessors, smartphone companies built their own systems operations—two data transfers, two in- to write.” Pundits noted delays and on a chip (SoC) using designs from teger operations, and two floating point underperformance of Itanium and re- other companies, including RISC operations—and compiler technology christened it “Itanic” after the ill-fated processors from ARM. Mobile-device could efficiently assign operations into Titantic passenger ship. The market- designers valued die area and energy the six instruction slots, the hardware place again eventually ran out of pa- efficiency as much as performance, could be made simpler. Like the RISC tience, leading to a 64-bit version of disadvantaging CISC ISAs. Moreover, approach, VLIW and EPIC shifted work the x86 as the successor to the 32-bit arrival of the Internet of Things vastly from the hardware to the compiler. x86, and not Itanium. increased both the number of proces- Working together, Intel and Hewlett The good news is VLIW still matches sors and the required trade-offs in die Packard designed a 64-bit processor based narrower applications with small pro- size, power, cost, and performance. on EPIC ideas to replace the 32-bit x86. grams and simpler branches and omit This trend increased the importance High expectations were set for the first caches, including digital-signal processing. of design time and cost, further dis- EPIC processor, called Itanium by In- advantaging CISC processors. In to- tel and Hewlett Packard, but the real- RISC vs. CISC in the day’s post-PC era, x86 shipments have ity did not match its developers’ early PC and Post-PC Eras fallen almost 10% per year since the claims. Although the EPIC approach AMD and Intel used 500-person de- peak in 2011, while chips with RISC worked well for highly structured sign teams and superior semicon- processors have skyrocketed to 20 bil- floating-point programs, it struggled ductor technology to close the per- lion. Today, 99% of 32-bit and 64-bit to achieve high performance for in- formance gap between x86 and RISC. processors are RISC. teger programs that had less predict- Again inspired by the performance Concluding this historical review, able cache misses or less-predictable advantages of pipelining simple vs. we can say the marketplace settled the branches. As Donald Knuth later complex instructions, the instruction RISC-CISC debate; CISC won the later noted:21 “The Itanium approach ... decoder translated the complex x86 stages of the PC era, but RISC is win- was supposed to be so terrific—un- instructions into internal RISC-like ning the post-PC era. There have been til it turned out that the wished-for microinstructions on the fly. AMD no new CISC ISAs in decades. To our compilers were basically impossible and Intel then pipelined the execu- surprise, the consensus on the best

52 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 turing lecture

ISA principles for general-purpose generation of technology, computers cluding approximately 15 branches, processors today is still RISC, 35 years would become more energy efficient. as they represent approximately 25% after their introduction. Dennard scaling began to slow sig- of executed instructions. To keep the nificantly in 2007 and faded to almost pipeline full, branches are predicted Current Challenges for nothing by 2012 (see Figure 3). and code is speculatively placed into Processor Architecture Between 1986 and about 2002, the the pipeline for execution. The use “If a problem has no solution, it may exploitation of instruction level paral- of speculation is both the source of not be a problem, but a fact—not to be lelism (ILP) was the primary architec- ILP performance and of inefficiency. solved, but to be coped with over time.” tural method for gaining performance When branch prediction is perfect, —Shimon Peres and, along with improvements in speed speculation improves performance While the previous section focused of transistors, led to an annual perfor- yet involves little added energy cost— on the design of the instruction set mance increase of approximately 50%. it can even save energy—but when it architecture (ISA), most computer The end of Dennard scaling meant ar- “mispredicts” branches, the proces- architects do not design new ISAs chitects had to find more efficient ways sor must throw away the incorrectly but implement existing ISAs in the to exploit parallelism. speculated instructions, and their prevailing implementation technol- To understand why increasing ILP computational work and energy are ogy. Since the late 1970s, the technol- caused greater inefficiency, consider wasted. The internal state of the pro- ogy of choice has been metal oxide a modern processor core like those cessor must also be restored to the semiconductor (MOS)-based inte- from ARM, Intel, and AMD. Assume it state that existed before the mispre- grated circuits, first n-type metal–ox- has a 15-stage pipeline and can issue dicted branch, expending additional ide semiconductor (nMOS) and then four instructions every clock cycle. It time and energy. complementary metal–oxide semi- thus has up to 60 instructions in the To see how challenging such a design conductor (CMOS). The stunning rate pipeline at any moment in time, in- is, consider the difficulty of correctly of improvement in MOS technology— captured in Gordon Moore’s predic- Figure 4. Wasted instructions as a percentage of all instructions completed on an Intel Core i7 for a variety of SPEC integer benchmarks. tions—has been the driving factor enabling architects to design more- aggressive methods for achieving 40% 39% 38% performance for a given ISA. Moore’s 35% 32% original prediction in 196526 called 30% 24% 25% 25% 22% for a doubling in transistor density 20% yearly; in 1975, he revised it, project- 15% 15% 28 11% ing a doubling every two years. It 10% 6% 7% eventually became called Moore’s 5% 5% 1% Law. Because transistor density grows 0 quadratically while speed grows lin- GCC MCF BZIP2 early, architects used more transis- ASTAR SJENG GOBMK HMMER H264REF PERLBEN tors to improve performance. OMNETPP XALANCBMK LIBQUANTUM End of Moore’s Law and Dennard Scaling Although Moore’s Law held for many Figure 5. Effect of Amdahl’s Law on speedup as a fraction of clock cycle time in serial mode. decades (see Figure 2), it began to slow sometime around 2000 and by 2018 showed a roughly 15-fold gap between 65 Moore’s prediction and current capa- 60 bility, an observation Moore made in 55 50 2003 that was inevitable.27 The current 45 expectation is that the gap will con- 40 1% tinue to grow as CMOS technology ap- 35 proaches fundamental limits. 30

Speedup 2% Accompanying Moore’s Law was a 25 projection made by Robert Dennard 20 5 4% called “Dennard scaling,” stating that 15 6% as transistor density increased, power 10 8% 10% consumption per transistor would 5 drop, so the power per mm2 of sili- 0 con would be near constant. Since the 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 computational capability of a mm2 of Processor Count silicon was increasing with each new

FEBRUARY 2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 53 turing lecture

predicting the outcome of 15 branches. ent approach to achieve performance a single core, assuming different por- If a processor architect wants to limit improvements. The multicore era was tions of serial execution, where only wasted work to only 10% of the time, thus born. one processor is active. For example, the processor must predict each branch Multicore shifted responsibility for when only 1% of the time is serial, the correctly 99.3% of the time. Few general- identifying parallelism and deciding speedup for a 64-processor configura- purpose programs have branches that how to exploit it to the programmer tion is about 35. Unfortunately, the can be predicted so accurately. and to the language system. Multicore power needed is proportional to 64 To appreciate how this wasted work does not resolve the challenge of ener- processors, so approximately 45% of adds up, consider the data in Figure 4, gy-efficient computation that was exac- the energy is wasted. showing the fraction of instructions erbated by the end of Dennard scaling. Real programs have more complex that are effectively executed but turn Each active core burns power whether structures of course, with portions out to be wasted because the proces- or not it contributes effectively to the that allow varying numbers of processor speculated incorrectly. On average, computation. A primary hurdle is an sors to be used at any given moment 19% of the instructions are wasted for old observation, called Amdahl’s Law, in time. Nonetheless, the need to com- these benchmarks on an Intel Core i7. stating that the speedup from a paral- municate and synchronize periodically The amount of wasted energy is great- lel computer is limited by the portion means most applications have some er, however, since the processor must of a computation that is sequential. portions that can effectively use only use additional energy to restore the To appreciate the importance of this a fraction of the processors. Although state when it speculates incorrectly. observation, consider Figure 5, show- Amdahl’s Law is more than 50 years Measurements like these led many to ing how much faster an application old, it remains a difficult hurdle. conclude architects needed a differ- runs with up to 64 cores compared to With the end of Dennard scaling, increasing the number of cores on a Figure 6. Growth of computer performance using integer programs (SPECintCPU). chip meant power is also increasing at nearly the same rate. Unfortunately, End of the Line ⇒ 2X/20 years (3%/yr) the power that goes into a processor Amdahl’s Law ⇒ 2X/6 years (12%/year) must also be removed as heat. Mul- End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year) ticore processors are thus limited by CISC 2X/2.5 years RISC 2X/1.5 years the thermal dissipation power (TDP), (22%/year) (52%/year) 100,000 or average amount of power the pack- age and cooling system can remove. Although some high-end data centers 10,000 may use more advanced packages and cooling technology, no computer us- 1,000 ers would want to put a small heat exchanger on their desks or wear a ra- 100 diator on their backs to cool their cell- phones. The limit of TDP led directly 10 to the era of “dark silicon,” whereby Performance vs. VAX11-780 Performance processors would slow on the clock 1 rate and turn off idle cores to prevent 1980 1985 1990 1995 2000 2005 2010 2015 overheating. Another way to view this approach is that some chips can real- locate their precious power from the Figure 7. Potential speedup of matrix multiply in Python for four optimizations. idle cores to the active ones. An era without Dennard scaling, Matrix Multiply Speedup Over Native Python along with reduced Moore’s Law and 62,806 Amdahl’s Law in full effect means 100,000 inefficiency limits improvement in 6,727 performance to only a few percent 10,000 per year (see Figure 6). Achieving 366 higher rates of performance improve- 1,000 ment—as was seen in the 1980s and

Speedup 1990s—will require new architec- 100 47 tural approaches that use the inte-

10 grated-circuit capability much more efficiently. We will return to what ap- 1 1 proaches might work after discussing Python C + parallel + memory + SIMD another major shortcoming of mod- loops optimization instructions ern computers—their support, or lack thereof, for computer security.

54 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 turing lecture

Overlooked Security to be attacked creates many new vul- In the 1970s, processor architects nerabilities. Two more vulnerabilities focused significant attention on en- in the virtual-machine architecture hancing computer security with con- were subsequently reported.37,38 One cepts ranging from protection rings of them, called Foreshadow, allows to capabilities. It was well under- The end of Dennard penetration of the Intel SGX security stood by these architects that most scaling meant mechanisms designed to protect the bugs would be in software, but they highest risk data (such as encryption believed architectural support could architects had to keys). New vulnerabilities are being help. These features were largely un- find more efficient discovered monthly. used by operating systems that were Side-channel attacks are not new, deliberately focused on supposedly ways to exploit but in most earlier cases, a software benign environments (such as per- flaw allowed the attack to succeed. In sonal computers), and the features parallelism. the Meltdown, Spectre, and other at- involved significant overhead then, so tacks, it is a flaw in the hardware im- were eliminated. In the software com- plementation that exposes protected munity, many thought formal verifica- information. There is a fundamental tion and techniques like microkernels difficulty in how processor architects would provide effective mechanisms define what is a correct implementa- for building highly secure software. tion of an ISA because the standard Unfortunately, the scale of our collec- definition says nothing about the tive software systems and the drive for performance effects of executing an performance meant such techniques instruction sequence, only about the could not keep up with processor per- ISA-visible architectural state of the formance. The result is large software execution. Architects need to rethink systems continue to have many securi- their definition of a correct implemen- ty flaws, with the effect amplified due tation of an ISA to prevent such securi- to the vast and increasing amount of ty flaws. At the same time, they should personal information online and the be rethinking the attention they pay use of cloud-based computing, which computer security and how architects shares physical hardware among po- can work with software designers to tential adversaries. implement more-secure systems. Ar- Although computer architects and chitects (and everyone else) depend others were perhaps slow to realize too much on more information sys- the growing importance of security, tems to willingly allow security to be they began to include hardware sup- treated as anything less than a first- port for virtual machines and encryp- class design concern. tion. Unfortunately, speculation in- troduced an unknown but significant Future Opportunities in security flaw into many processors. In Computer Architecture particular, the Meltdown and Spectre “What we have before us are some breath- security flaws led to new vulnerabili- taking opportunities disguised as insoluble ties that exploit vulnerabilities in the problems.” —John Gardner, 1965 microarchitecture, allowing leakage Inherent inefficiencies in general- of protected information at a high purpose processors, whether from ILP rate.14 Both Meltdown and Spectre use techniques or multicore, combined so-called side-channel attacks where- with the end of Dennard scaling and by information is leaked by observing Moore’s Law, make it highly unlikely, the time taken for a task and convert- in our view, that processor architects ing information invisible at the ISA and designers can sustain significant level into a timing visible attribute. In rates of performance improvements in 2018, researchers showed how to ex- general-purpose processors. Given the ploit one of the Spectre variants to leak importance of improving performance information over a network without to enable new software capabilities, the attacker loading code onto the tar- we must ask: What other approaches get processor.34 Although this attack, might be promising? called NetSpectre, leaks information There are two clear opportunities, as slowly, the fact that it allows any ma- well as a third created by combining the chine on the same local-area network two. First, existing techniques for build- (or within the same cluster in a cloud) ing software make extensive use of high-

FEBRUARY 2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 55 turing lecture

Figure 8. Functional organization of Google Tensor Processing Unit (TPU v1).

14 GiB/s 30 GiB/s DDR3 Weight FIFO Interfaces (Weight Fetcher)

Control Control 30 GiB/s

Local Uniﬁed Buffer Matrix Multiply Unit 165 for Activations (256x256x8b = 64K MAC) 10 GiB/s Uniﬁed Buffer GiB/s (96Kx256x8b = 24 MiB) 29% Systolic Matrix 29% of chip (Local Array Multiply Unit Activation Control (64K per cycle) 14 GiB/s 14 GiB/s Storage) PCIe Interface Host Interface Host Accumulators Control Accumulators D Interface (4Kx256x32b = 4 MiB) D R 2% 6% R A A Activation M M Control 2% Activation Pipeline 6% port 165 GiB/s port ddr3 ddr3 Off-Chip I/O Instr Normalize/Pool 3% PCIe 3% Data Buffer Interface 3% Misc. I/O 1% Computation Control Control Control Not to Scale

level languages with dynamic typing and An interesting research direction application-specific integrated cir- storage management. Unfortunately, concerns whether some of the perfor- cuits (ASICs) that are often used for a such languages are typically interpreted mance gap can be closed with new com- single function with code that rarely and execute very inefficiently. Leiserson piler technology, possibly assisted by changes. DSAs are often called acceler- et al.24 used a small example—perform- architectural enhancements. Although ators, since they accelerate some of an ing matrix multiply—to illustrate this the challenges in efficiently translating application when compared to execut- inefficiency. As in Figure 7, simply re- and implementing high-level scripting ing the entire application on a general- writing the code in C from Python—a languages like Python are difficult, the purpose CPU. Moreover, DSAs can typical high-level, dynamically typed lan- potential gain is enormous. Achieving achieve better performance because guage—increases performance 47-fold. even 25% of the potential gain could they are more closely tailored to the Using parallel loops running on many result in Python programs running needs of the application; examples of cores yields a factor of approximately tens to hundreds of times faster. This DSAs include graphics processing 7. Optimizing the memory layout to ex- simple example illustrates how great units (GPUs), neural network proces- ploit caches yields a factor of 20, and a the gap is between modern languages sors used for deep learning, and pro- final factor of 9 comes from using the emphasizing programmer productivity cessors for software-defined networks hardware extensions for doing single in- and traditional approaches emphasiz- (SDNs). DSAs can achieve higher per- struction multiple data (SIMD) parallel- ing performance. formance and greater energy efficiency ism operations that are able to perform Domain-specific architectures. A for four main reasons: 16 32-bit operations per instruction. more hardware-centric approach is to First and most important, DSAs All told, the final, highly optimized ver- design architectures tailored to a spe- exploit a more efficient form of par- sion runs more than 62,000× faster on cific problem domain and offer signif- allelism for the specific domain. For a multicore Intel processor compared icant performance (and efficiency) example, single-instruction multiple to the original Python version. This is of gains for that domain, hence, the data parallelism (SIMD), is more ef- course a small example, one might ex- name “domain-specific architectures” ficient than multiple instruction mul- pect programmers to use an optimized (DSAs), a class of processors tailored tiple data (MIMD) because it needs to library for. Although it exaggerates the for a specific domain—programmable fetch only one instruction stream and usual performance gap, there are likely and often Turing-complete but tai- processing units operate in lockstep.9 many programs for which factors of 100 lored to a specific class of applica- Although SIMD is less flexible than to 1,000 could be achieved. tions. In this sense, they differ from MIMD, it is a good match for many

56 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 turing lecture

DSAs. DSAs may also use VLIW ap- cessors operate. For suitable applica- to the processor efficiently. Examples proaches to ILP rather than specula- tions, user-controlled memories can of DSLs include Matlab, a language for tive out-of-order mechanisms. As men- use much less energy than caches. operating on matrices, TensorFlow, a tioned earlier, VLIW processors are a Third, DSAs can use less precision dataflow language used for program- poor match for general-purpose code15 when it is adequate. General-purpose ming DNNs, P4, a language for pro- but for limited domains can be much CPUs usually support 32- and 64-bit in- gramming SDNs, and Halide, a lan- more efficient, since the control mech- teger and floating-point (FP) data. For guage for image processing specifying anisms are simpler. In particular, most many applications in machine learn- high-level transformations. high-end general-purpose processors ing and graphics, this is more accuracy The challenge when using DSLs is are out-of-order superscalars that re- than is needed. For example, in deep how to retain enough architecture in- quire complex control logic for both neural networks (DNNs), inference dependence that software written in instruction initiation and instruction regularly uses 4-, 8-, or 16-bit integers, a DSL can be ported to different ar- completion. In contrast, VLIWs per- improving both data and computation- chitectures while also achieving high form the necessary analysis and sched- al throughput. Likewise, for DNN train- efficiency in mapping the software uling at compile-time, which can work ing applications, FP is useful, but 32 to the underlying DSA. For example, well for an explicitly parallel program. bits is enough and 16 bits often works. the XLA system translates Tensorflow Second, DSAs can make more effec- Finally, DSAs benefit from targeting to heterogeneous processors that tive use of the memory hierarchy. Mem- programs written in domain-specific use Nvidia GPUs or Tensor Processor ory accesses have become much more languages (DSLs) that expose more Units (TPUs).40 Balancing portability costly than arithmetic computations, parallelism, improve the structure and among DSAs along with efficiency is as noted by Horowitz.16 For example, representation of memory access, and an interesting research challenge for accessing a block in a 32-kilobyte cache make it easier to map the application ef- language designers, compiler creators, involves an energy cost approximately ficiently to a domain-specific processor. and DSA architects. 200× higher than a 32-bit integer add. Example DSA: TPU v1. As an example This enormous differential makes Domain-Specific Languages DSA, consider the Google TPU v1, which optimizing memory accesses critical DSAs require targeting of high-level op- was designed to accelerate neural net to achieving high-energy efficiency. erations to the architecture, but trying inference.17,18 The TPU has been in General-purpose processors run code to extract such structure and informa- production since 2015 and powers ap- in which memory accesses typically ex- tion from a general-purpose language plications ranging from search queries hibit spatial and temporal locality but like Python, Java, C, or Fortran is sim- to language translation to image recog- are otherwise not very predictable at ply too difficult. Domain specific lan- nition to AlphaGo and AlphaZero, the compile time. CPUs thus use multilevel guages (DSLs) enable this process and DeepMind programs for playing Go and caches to increase bandwidth and hide make it possible to program DSAs ef- Chess. The goal was to improve the per- the latency in relatively slow, off-chip ficiently. For example, DSLs can make formance and energy efficiency of deep DRAMs. These multilevel caches often vector, dense matrix, and sparse ma- neural net inference by a factor of 10. consume approximately half the energy trix operations explicit, enabling the As shown in Figure 8, the TPU or- of the processor but avoid almost all DSL compiler to map the operations ganization is radically different from a accesses to the off-chip DRAMs that require approximately 10× the energy of a Figure 9. Agile hardware development methodology. last-level cache access. Caches have two notable disadvan- tages: Big Chip Tape-Out When datasets are very large. Caches simply do not work well when datasets Tape-Out are very large and also have low temporal or spatial locality; and When caches work well. When Tape-In caches work well, the locality is very high, meaning, by definition, most of the cache is idle most of the time. ASIC Flow In applications where the memory- access patterns are well defined and FPGA discoverable at compile time, which is true of typical DSLs, programmers C++ and compilers can optimize the use of the memory better than can dynamically allocated caches. DSAs thus usually use a hierarchy of memories with movement controlled explicitly by the software, similar to how vector pro-

FEBRUARY 2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 57 turing lecture

general-purpose processor. The main amine and make complex trade-offs and Foundation (http://riscv.org/). Being computational unit is a matrix unit, optimizations will be advantaged. open allows the ISA evolution to occur a systolic array22 structure that pro- This opportunity has already led to in public, with hardware and software vides 256 × 256 multiply-accumulates a surge of architecture innovation, at- experts collaborating before decisions every clock cycle. The combination of tracting many competing architectural are finalized. An added benefit of an 8-bit precision, highly efficient sys- philosophies: open foundation is the ISA is unlikely to tolic structure, SIMD control, and GPUs. Nvidia GPUs use many cores, expand primarily for marketing reasons, dedication of significant chip area to each with large register files, many sometimes the only explanation for ex- this function means the number of hardware threads, and caches;4 tensions of proprietary instruction sets. multiply-accumulates per clock cycle TPUs. Google TPUs rely on large RISC-V is a modular instruction set. is approximately 100× what a general- two-dimensional systolic multipli- A small base of instructions run the full purpose single-core CPU can sustain. ers and software-controlled on-chip open source software stack, followed by Rather than caches, the TPU uses a lo- memories;17 optional standard extensions designers cal memory of 24 megabytes, approxi- FPGAs. Microsoft deploys field pro- can include or omit depending on their mately double a 2015 general-purpose grammable gate arrays (FPGAs) in its needs. This base includes 32-bit address CPU with the same power dissipa- data centers it tailors to neural network and 64-bit address versions. RISC-V can tion. Finally, both the activation applications;10 and grow only through optional extensions; memory and the weight memory (in- CPUs. Intel offers CPUs with many the software stack still runs fine even if cluding a FIFO structure that holds cores enhanced by large multi-level architects do not embrace new exten- weights) are linked through user- caches and one-dimensional SIMD in- sions. Proprietary architectures gener- controlled high-bandwidth memory structions, the kind of FPGAs used by ally require upward binary compatibil- channels. Using a weighted arith- Microsoft, and a new neural network ity, meaning when a processor company metic mean based on six common processor that is closer to a TPU than adds new feature, all future processors inference problems in Google data to a CPU.19 must also include it. Not so for RISC-V, centers, the TPU is 29× faster than a In addition to these large players, whereby all enhancements are optional general-purpose CPU. Since the TPU dozens of startups are pursuing their and can be deleted if not needed by an requires less than half the power, it own proposals.25 To meet growing de- application. Here are the standard ex- has an energy efficiency for this work- mand, architects are interconnecting tensions so far, using initials that stand load that is more than 80× better than a hundreds to thousands of such chips to for their full names: general-purpose CPU. form neural-network supercomputers. M. Integer multiply/divide; This avalanche of DNN architec- A. Atomic memory operations; Summary tures makes for interesting times in F/D. Single/double-precision float- We have considered two different ap- computer architecture. It is difficult to ing-point; and proaches to improve program perfor- predict in 2019 which (or even if any) of C. Compressed instructions. mance by improving efficiency in the these many directions will win, but the A third distinguishing feature of use of hardware technology: First, by marketplace will surely settle the com- RISC-V is the simplicity of the ISA. improving the performance of modern petition just as it settled the architec- While not readily quantifiable, here are high-level languages that are typically tural debates of the past. two comparisons to the ARMv8 archi- interpreted; and second, by building do- tecture, as developed by the ARM com- main-specific architectures that greatly Open Architectures pany contemporaneously: improve performance and efficiency Inspired by the success of open source Fewer instructions. RISC-V has many compared to general-purpose CPUs. software, the second opportunity in fewer instructions. There are 50 in DSLs are another example of how to im- computer architecture is open ISAs. the base that are surprisingly similar prove the hardware/software interface To create a “Linux for processors” the in number and nature to the origi- that enables architecture innovations field needs industry-standard open nal RISC-I.30 The remaining standard like DSAs. Achieving significant gains ISAs so the community can create extensions—M, A, F, and D—add 53 through such approaches will require open source cores, in addition to indi- instructions, plus C added another 34, a vertically integrated design team that vidual companies owning proprietary totaling 137. ARMv8 has more than understands applications, domain- ones. If many organizations design 500; and specific languages and related compil- processors using the same ISA, the Fewer instruction formats. RISC-V er technology, computer architecture greater competition may drive even has many fewer instruction formats, and organization, and the underlying quicker innovation. The goal is to six, while ARMv8 has at least 14. implementation technology. The need provide processors for chips that cost Simplicity reduces the effort to both to vertically integrate and make design from a few cents to $100. design processors and verify hardware decisions across levels of abstraction The first example is RISC-V (called correctness. As the RISC-V targets range was characteristic of much of the early “RISC Five”), the fifth RISC architecture from data-center chips to IoT devices, work in computing before the industry developed at the University of Califor- design verification can be a significant became horizontally structured. In this nia, Berkeley.32 RISC-V’s has a commu- part of the cost of development. new era, vertical integration has become nity that maintains the architecture Fourth, RISC-V is a clean-slate de- more important, and teams that can ex- under the stewardship of the RISC-V sign, starting 25 years later, letting its

58 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 turing lecture architects learn from mistakes of its tion in waterfall development. Small predecessors. Unlike first-generation programming teams quickly developed RISC architectures, it avoids microar- working-but-incomplete prototypes and chitecture or technology-dependent got customer feedback before starting features (such as delayed branches and the next iteration. The scrum version of delayed loads) or innovations (such as Security experts agile development assembles teams of register windows) that were supersed- do not believe in five to 10 programmers doing sprints of ed by advances in compiler technology. two to four weeks per iteration. Finally, RISC-V supports DSAs by re- security through Once again inspired by a software serving a vast opcode space for custom obscurity, so open success, the third opportunity is ag- accelerators. ile hardware development. The good Beyond RISC-V, Nvidia also an- implementations news for architects is that modern nounced (in 2017) a free and open ar- electronic computer aided design chitecture29 it calls Nvidia Deep Learn- are attractive, (ECAD) tools raise the level of abstrac- ing Accelerator (NVDLA), a scalable, and open tion, enabling agile development, and configurable DSA for machine-learning this higher level of abstraction increas- inference. Configuration options in- implementations es reuse across designs. clude data type (int8, int16, or fp16 ) require an open It seems implausible to claim sprints and the size of the two-dimensional of four weeks to apply to hardware, giv- multiply matrix. Die size scales from architecture. en the months between when a design 0.5 mm2 to 3 mm2 and power from 20 is “taped out” and a chip is returned. milliWatts to 300 milliWatts. The ISA, Figure 9 outlines how an agile develop- software stack, and implementation ment method can work by changing the are all open. prototype at the appropriate level.23 The Open simple architectures are syn- innermost level is a software simulator, ergistic with security. First, security ex- the easiest and quickest place to make perts do not believe in security through changes if a simulator could satisfy an obscurity, so open implementations iteration. The next level is FPGAs that are attractive, and open implementa- can run hundreds of times faster than a tions require an open architecture. detailed software simulator. FPGAs can Equally important is increasing the run operating systems and full bench- number of people and organizations marks like those from the Standard who can innovate around secure ar- Performance Evaluation Corporation chitectures. Proprietary architectures (SPEC), allowing much more precise limit participation to employees, but evaluation of prototypes. Amazon Web open architectures allow all the best Services offers FPGAs in the cloud, so minds in academia and industry to architects can use FPGAs without need- help with security. Finally, the simplic- ing to first buy hardware and set up a ity of RISC-V makes its implementa- lab. To have documented numbers for tions easier to check. Moreover, the die area and power, the next outer level open architectures, implementations, uses the ECAD tools to generate a chip’s and software stacks, plus the plasticity layout. Even after the tools are run, some of FPGAs, mean architects can deploy manual steps are required to refine the and evaluate novel solutions online results before a new processor is ready to and iterate them weekly instead of an- be manufactured. Processor designers nually. While FPGAs are 10× slower call this next level a “tape in.” These first than custom chips, such performance four levels all support four-week sprints. is still fast enough to support online For research purposes, we could stop users and thus subject security innova- at tape in, as area, energy, and perfor- tions to real attackers. We expect open mance estimates are highly accurate. architectures to become the exemplar However, it would be like running a for hardware/software co-design by ar- long race and stopping 100 yards be- chitects and security experts. fore the finish line because the runner can accurately predict the final time. Agile Hardware Development Despite all the hard work in race prepa- The Manifesto for Agile Software Develop- ration, the runner would miss the thrill ment (2001) by Beck et al.1 revolution- and satisfaction of actually crossing the ized software development, overcoming finish line. One advantage hardware en- the frequent failure of the traditional gineers have over software engineers is elaborate planning and documenta- they build physical things. Getting chips

FEBRUARY 2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 59 turing lecture

for Industrial and Applied Mathematics, Philadelphia, back to measure, run real programs, References PA, 1979, 256–282. 1. Beck, K., Beedle, M., Van Bennekum, A., Cockburn, A., and show to their friends and family is 23. Lee, Y., Waterman, A., Cook, H., Zimmer, B., Keller, Cunningham, W., Fowler, M. . . . and Kern, J. Manifesto B., Puggelli, A. . . . and Chiu, P. An agile approach to a great joy of hardware design. for Agile Software Development, 2001; https:// building RISC-V microprocessors. IEEE Micro 36, 2 agilemanifesto.org/ Many researchers assume they must (Feb. 2016), 8–20. 2. Bhandarkar, D. and Clark, D.W. Performance from 24. Leiserson, C. et al. There’s plenty of room at the top. stop short because fabricating chips is architecture: Comparing a RISC and a CISC with To appear. similar hardware organization. In Proceedings of the unaffordable. When designs are small, 25. Metz, C. Big bets on A.I. open a new frontier for chip Fourth International Conference on Architectural start-ups, too. The New York Times (Jan. 14, 2018). they are surprisingly inexpensive. Archi- Support for Programming Languages and Operating 26. Moore, G. Cramming more components onto 2 Systems (Santa Clara, CA, Apr. 8–11). ACM Press, tects can order 100 1-mm chips for only integrated circuits. Electronics 38, 8 (Apr. 19, 1965), New York, 1991, 310–319. 2 56–59. 3. Chaitin, G. et al. Register allocation via coloring. $14,000. In 28 nm, 1 mm holds millions 27. Moore, G. No exponential is forever: But ‘forever’ can Computer Languages 6, 1 (Jan. 1981), 47–57. of transistors, enough area for both a be delayed! [semiconductor industry]. In Proceedings 4. Dally, W. et al. Hardware-enabled artificial intelligence. of the IEEE International Solid-State Circuits In Proceedings of the Symposia on VLSI Technology RISC-V processor and an NVLDA accel- Conference Digest of Technical Papers (San Francisco, and Circuits (Honolulu, HI, June 18–22). IEEE Press, CA, Feb. 13). IEEE, 2003, 20–23. erator. The outermost level is expensive 2018, 3–6. 28. Moore, G. Progress in digital integrated electronics. In 5. Dennard, R. et al. Design of ion-implanted MOSFETs if the designer aims to build a large chip, Proceedings of the International Electronic Devices with very small physical dimensions. IEEE Journal of Meeting (Washington, D.C., Dec.). IEEE, New York, but an architect can demonstrate many Solid State Circuits 9, 5 (Oct. 1974), 256–268. 1975, 11–13. 6. Emer, J. and Clark, D. A characterization of processor novel ideas with small chips. 29. Nvidia. Nvidia Deep Learning Accelerator (NVDLA), performance in the VAX-11/780. In Proceedings 2017; http://nvdla.org/ of the 11th International Symposium on Computer 30. Patterson, D. How Close is RISC-V to RISC-I? Architecture (Ann Arbor, MI, June). ACM Press, New Conclusion ASPIRE blog, June 19, 2017; https://aspire.eecs. York, 1984, 301–310. berkeley.edu/2017/06/how-close-is-risc-v-to-risc-i/ “The darkest hour is just before the 7. Fisher, J. The VLIW machine: A multiprocessor for 31. Patterson, D. RISCy history. Computer Architecture compiling scientific code.Computer 17, 7 (July 1984), dawn.” —Thomas Fuller, 1650 Today blog, May 30, 2018; https://www.sigarch.org/ 45–53. riscy-history/ To benefit from the lessons of his- 8. Fitzpatrick, D.T., Foderaro, J.K., Katevenis, M.G., 32. Patterson, D. and Waterman, A. The RISC-V Reader: Landman, H.A., Patterson, D.A., Peek, J.B., Peshkess, tory, architects must appreciate that An Open Architecture Atlas. Strawberry Canyon LLC, Z., Séquin, C.H., Sherburne, R.W., and Van Dyke, K.S. A San Francisco, CA, 2017. software innovations can also inspire RISCy approach to VLSI. ACM SIGARCH Computer 33. Rowen, C., Przbylski, S., Jouppi, N., Gross, T., Architecture News 10, 1 (Jan. 1982), 28–32. architects, that raising the abstraction Shott, J., and Hennessy, J. A pipelined 32b NMOS 9. Flynn, M. Some computer organizations and their microprocessor. In Proceedings of the IEEE effectiveness. IEEE Transactions on Computers 21, 9 level of the hardware/software interface International Solid-State Circuits Conference Digest (Sept. 1972), 948–960. yields opportunities for innovation, and of Technical Papers (San Francisco, CA, Feb. 22–24) 10. Fowers, J. et al. A configurable cloud-scale DNN IEEE, 1984, 180–181. that the marketplace ultimately settles processor for real-time AI. In Proceedings of the 34. Schwarz, M., Schwarzl, M., Lipp, M., and Gruss, D. th ACM/IEEE Annual International Symposium on 45 Netspectre: Read arbitrary memory over network. arXiv computer architecture debates. The Computer Architecture (Los Angeles, CA, June 2–6). preprint, 2018; https://arxiv.org/pdf/1807.10535.pdf IEEE, 2018, 1–14. iAPX-432 and Itanium illustrate how 35. Sherburne, R., Katevenis, M., Patterson, D., and Sequin, 11. Hennessy, J. and Patterson, D. A New Golden Age for C. A 32b NMOS microprocessor with a large register architecture investment can exceed re- Computer Architecture. Turing Lecture delivered at file. InProceedings of the IEEE International Solid- th ACM/IEEE Annual International Symposium turns, while the S/360, 8086, and ARM the 45 State Circuits Conference (San Francisco, CA, Feb. deliver high annual returns lasting de- on Computer Architecture (Los Angeles, CA, June 4, 22–24). IEEE Press, 1984, 168–169. 2018); http://iscaconf.org/isca2018/turing_lecture.html; 36. Thacker, C., MacCreight, E., and Lampson, B. Alto: cades with no end in sight. https://www.youtube.com/watch?v=3LVeEjsn8Ts A Personal Computer. CSL-79-11, Xerox Palo Alto The end of Dennard scaling and 12. Hennessy, J., Jouppi, N., Przybylski, S., Rowen, Research Center, Palo Alto, CA, Aug. 7,1979; http:// C., Gross, T., Baskett, F., and Gill, J. MIPS: A people.scs.carleton.ca/~soma/distos/fall2008/alto.pdf Moore’s Law and the deceleration of per- microprocessor architecture. ACM SIGMICRO 37. Turner, P., Parseghian, P., and Linton, M. Protecting Newsletter 13, 4 (Oct. 5, 1982), 17–22. formance gains for standard micropro- against the new ‘L1TF’ speculative vulnerabilities. 13. Hennessy, J. and Patterson, D. Computer Architecture: Google blog, Aug. 14, 2018; https://cloud.google.com/ cessors are not problems that must be A Quantitative Approach. Morgan Kauffman, San blog/products/gcp/protectingagainst-the-new-l1tf- Francisco, CA, 1989. solved but facts that, recognized, offer speculative-vulnerabilities 14. Hill, M. A primer on the meltdown and Spectre 38. Van Bulck, J. et al. Foreshadow: Extracting the keys breathtaking opportunities. High-level, hardware security design flaws and their important to the Intel SGX kingdom with transient out-of-order implications, Computer Architecture Today blog (Feb. execution. In Proceedings of the 27th USENIX Security domain-specific languages and archi- 15, 2018); https://www.sigarch.org/a-primer-on-the- Symposium (Baltimore, MD, Aug. 15–17). USENIX tectures, freeing architects from the meltdown-spectre-hardware-security-design-flaws- Association, Berkeley, CA, 2018. and-their-important-implications/ 39. Wilkes, M. and Stringer, J. Micro-programming and the chains of proprietary instruction sets, 15. Hopkins, M. A critical look at IA-64: Massive design of the control circuits in an electronic digital along with demand from the public for resources, massive ILP, but can it deliver? computer. Mathematical Proceedings of the Cambridge Microprocessor Report 14, 2 (Feb. 7, 2000), 1–5. Philosophical Society 49, 2 (Apr. 1953), 230–238. improved security, will usher in a new 16. Horowitz M. Computing’s energy problem (and what 40. XLA Team. XLA – TensorFlow. Mar. 6, 2017; https:// golden age for computer architects. we can do about it). In Proceedings of the IEEE developers.googleblog.com/2017/03/xlatensorflow- International Solid-State Circuits Conference Digest of compiled.html Aided by open source ecosystems, ag- Technical Papers (San Francisco, CA, Feb. 9–13). IEEE Press, 2014, 10–14. ilely developed chips will convincingly 17. Jouppi, N., Young, C., Patil, N., and Patterson, D. A John L. Hennessy ([email protected]) is demonstrate advances and thereby domain-specific architecture for deep neural networks. Past-President of Stanford University, Stanford, CA, USA, Commun. ACM 61, 9 (Sept. 2018), 50–58. and is Chairman of Alphabet Inc., Mountain View, CA, USA. accelerate commercial adoption. The 18. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, ISA philosophy of the general-purpose G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, David A. Patterson ([email protected]) is the Pardee A., and Boyle, R. In-datacenter performance analysis Professor of Computer Science, Emeritus at the University processors in these chips will likely be of a tensor processing unit. In Proceedings of the of California, Berkeley, CA, USA, and a Distinguished RISC, which has stood the test of time. 44th ACM/IEEE Annual International Symposium on Engineer at Google, Mountain View, CA, USA. Computer Architecture (Toronto, ON, Canada, June Expect the same rapid improvement as 24–28). IEEE Computer Society, 2017, 1–12. in the last golden age, but this time in 19. Kloss, C. Nervana Engine Delivers Deep Learning at © 2019 ACM 0001-0782/19/2 $15.00 Ludicrous Speed. Intel blog, May 18, 2016; terms of cost, energy, and security, as https://ai.intel.com/nervana-engine-delivers-deep- well as in performance. learning-at-ludicrous-speed/ 20. Knuth, D. The Art of Computer Programming: The next decade will see a Cambri- Fundamental Algorithms, First Edition. Addison Wesley, Reading, MA, 1968. an explosion of novel computer archi- 21. Knuth, D. and Binstock, A. Interview with Donald tectures, meaning exciting times for Knuth. InformIT, Hoboken, NJ, 2010; http://www. To watch Hennessy and informit.com/articles/article.aspx Patterson’s full Turing Lecture, see computer architects in academia and 22. Kung, H. and Leiserson, C. Systolic arrays (for VLSI). https://www.acm.org/hennessy- in industry. Chapter in Sparse Matrix Proceedings Vol. 1. Society patterson-turing-lecture

60 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2