<<

...... AN AGILE APPROACH TO BUILDING RISC-V MICROPROCESSORS

...... THE AUTHORS ADOPTED AN AGILE HARDWARE DEVELOPMENT METHODOLOGY FOR 11 Yunsup Lee RISC-V MICROPROCESSOR TAPE-OUTS ON 28-NM AND 45-NM CMOS PROCESSES.THIS Andrew Waterman ENABLED SMALL TEAMS TO QUICKLY BUILD ENERGY-EFFICIENT, COST-EFFECTIVE, AND Henry Cook COMPETITIVE HIGH-PERFORMANCE MICROPROCESSORS.THE AUTHORS PRESENT A CASE Brian Zimmer STUDY OF ONE PROTOTYPE FEATURING A RISC-V VECTOR MICROPROCESSOR INTEGRATED Ben Keller WITH SWITCHED-CAPACITOR DC–DC CONVERTERS AND AN ADAPTIVE CLOCK GENERATOR Alberto Puggelli IN A 28-NM, FULLY DEPLETED SILICON-ON-INSULATOR PROCESS. Jaehwa Kwak ...... The era of rapid transistor per- in software development, demarcated by the Ruzica Jevtic´ formance improvement is coming to an end, publication of The Agile Manifesto in 2001.1 increasing pressure on architects and circuit The philosophy of agile software development Stevo Bailey designers to improve energy efficiency. Mod- emphasizes individuals and interactions over ern systems on chip (SoCs) incorporate a large processes and tools, working software over Milovan Blagojevic´ and growing number of specialized hardware comprehensive documentation, customer col- units to execute specific tasks efficiently, often laboration over contract negotiation, and Pi-Feng Chiu organized into multiple dynamic voltage and responding to change over following a plan. clock frequency domains to further save In practice, the agile approach leads to small Rimas Avizienis energy under varying computational loads. teams iteratively refining a set of working-but- This growing complexity has led to a design incomplete prototypes until the end result is Brian Richards productivity crisis, stimulating development acceptable. of new tools and methodologies to enable the Inspired by the positive disruption of The Jonathan Bachrach completion of complex chip designs on sched- Agile Manifesto on software development, we ule and within budget. propose a set of principles to guide a new David Patterson We propose leveraging lessons learned agile hardware development methodology. from the software world by applying aspects Our proposal is informed by our experiences Elad Alon of the software agile development model to as a small group of researchers designing and hardware. Traditionally, software was devel- fabricating 11 processor chips in five years. Borivoje Nikolic´ oped via the waterfall development model, a Lacking the massive resources of industrial sequential process that consists of distinct design teams, we were forced to abandon Krste Asanovic´ phases that rigidly follow one another, just as standard industry practices and explore differ- is done for hardware design today. Over- ent approaches to hardware design. We detail budget, late, and abandoned software projects this approach’s benefits with a case study of were commonplace, motivating a revolution one of these chips, Raven-3, which achieved ......

8 Published by the IEEE Computer Society 0272-1732/16/$33.00 2016 IEEE groundbreaking energy efficiency by integrat- transfer level (RTL) freezes, and CPU centu- ing a novel RISC-V vector architecture with ries of simulations in an attempt to achieve a efficient on-chip DC–DC converters. Taken single viable design point. In our agile hard- together, our experiences developing these aca- ware methodology, we first push a trivial pro- demic prototype chips make a powerful argu- totype with a minimal working feature set all ment for the promise of agile hardware design the way through the toolflow to a point where in the post-Moore’s law era. it could be taped out for fabrication. We refer to this tape-out-ready design as a tape-in. An Agile Hardware Manifesto Then we begin adding features iteratively, moving from one fabricatable tape-in to the Borrowing heavily from the agile software next as we add features. The agile hardware development manifesto, we propose the fol- methodology will always have an available lowing principles for hardware design: tape-outcandidatewithsomesubsetoffea- Incomplete, fabricatable prototypes tures for any tape-out deadline, whereas the over fully featured models. conventional waterfall approach is in danger Collaborative, flexible teams over rigid of grossly overshooting any tape-out deadline silos. because of issues at any stage of the process. Improvement of tools and generators Although the agile methodology provides over improvement of the instance. some benefits in terms of verification effort, it Response to change over following a particularly shines at validation; agile designs plan. aremorelikelytomeetenergyandperform- ance expectations using the iterative process. Here, we explain the importance of these Unlike conventional approaches that use principles for building hardware and contrast high-level architectural models of the whole them with the original principles of agile soft- design for validation with the intent to later ware development. refine these models into an implementation, we emphasize validation using a full RTL Incomplete, Fabricatable Prototypes over Fully implementation of each incomplete prototype Featured Models with the intent to add features if there is time. We believe the primary benefit of adopting Unlike high-level architectural models, the the agile model for hardware design is that it actual RTL is guaranteed to be cycle accurate. lets us drastically reduce the cost of verifica- Furthermore, the actual RTL design can be tion and validation by iterating through pushed through the entire very large-scale many working prototypes of our designs. In integration (VLSI) toolflow to give early feed- this article, we use the standard definition of back on back-end physical design issues that verification as testing that determines whether will affect the quality of results (QoR), includ- each component and the assembly of compo- ing cycle time, area, and power. For example, nents correctly meets its specification (“Are Figure 1b shows a feature F1 being reworked we building the thing right?”). In hardware into feature F10 after validation showed that it design, the term validation is usually nar- would not meet performance or QoR goals, rowly applied to postsilicon testing to ensure whereas feature F3 was dropped after valida- that manufactured parts operate within tion showed that it exceeded physical design design parameters. We use the broader and, budgets or did not add sufficient value to the to our minds, more useful definition of vali- end system. These decisions on F1 and F3 can dation from the software world, in which val- be informed by the results of the attempted idation ensures that the product serves its physical design before completing the final intended purposes (“Are we building the feature F4. right thing?”). Figure 1c illustrates our iterative approach Figure 1 contrasts the agile hardware de- to validation of a single prototype. Each velopment model with the waterfall model. circle’s circumference represents the relative The waterfall model, shown in Figure 1a, time and effort it takes to begin validating a relies on Gantt charts, high-level architecture design using a particular methodology. models, rigid processes such as register- Although the latency and cost of these ...... MARCH/APRIL 2016 9 ...... HOT CHIPS

F1 F1 F2 F3 F4 F2 Specification F1F2 F1’ F3 F4 F3 Design F1F2 F1’ F3 F4 F4 Implementation F1F2 F1’ F3 F4 Design Tape-out Validation Verification Specification Verification F1F2 F1’ F3 F4 Implementation Tape-in F1/F2 F1’/F2/F3 F1’/F2/F4

Validation F1/F2 F1’/F2/F3 F1’/F2/F4

Tape-out F1’/F2/F4 (a) (b)

Big tape-out

Small tape-out Tape-in ASIC flow FPGA C++

Specification + design + implementation (c)

Figure 1. Contrasting the agile and waterfall models of hardware design. The labels FN represent various desired features of the design. (a) The waterfall model steps all features through each activity sequentially, producing a tape-out candidate only when all features are complete. (b) The agile model adds features incrementally, resulting in incremental tape-out candidates as individual features are completed, reworked, or abandoned. (c) As validation progresses, designs are subjected to lengthier and more accurate evaluation methodologies.

prototype-modeling efforts increase toward Collaborative, Flexible Teams over Rigid Silos the outer circles, our confidence in the valida- Traditional hardware design teams usually tion results produced increases in tandem. are organized around the waterfall model, Using fabricatable prototypes increases valida- with engineers assigned to particular func- tion bandwidth, because a complete RTL tions, such as architecture specification, design can be mapped to field-programmable microarchitecture design, RTL design, veri- gate arrays (FPGAs) to run end-application fication, physical design, and validation. software stacks orders of magnitude faster This specialization of skills is more extreme than with software simulators. In agile hard- than for pure software development, as ware development, the FPGA models of itera- reflected in the specificity of designers’ job tive prototype RTL designs (together with titles (architect, hardware engineer, or veri- accompanying QoR numbers from the VLSI fication engineer). Sometimes, entire func- toolflow) fulfill the same function as working tions, such as back-end physical design, are prototypes in agile software development, pro- even outsourced to different companies. As viding a way for customers to give early and the design progresses through these func- frequent feedback to validate design choices. tional stages, extensive effort is required This enables customer collaboration as envi- to communicate design intent for each sioned in the original agile software manifesto. block, and consequently to understand the ...... 10 IEEE MICRO documentation received from the preceding tools hinder development of truly reusable stage. Misunderstandings can lead to subtle hardware components and lead to a focus on errors that require extensive verification to building a single instance instead of design- prevent, and QoR suffers because no engi- ing components that can be easily reused in neer views the whole flow. End-to-end future designs. design iterations are too expensive to con- An agile hardware methodology relies on template. Considerable effort is required better hardware description languages (HDLs) to push high-level changes down through to enable reuse. Instead of constructing an the implementation hierarchy, particularly optimized instance, engineers develop highly across company boundaries. This leads to parameterized generators that can be used to innovation-stunting practices such as the automatically explore a design space or to RTL freeze, in which only show-stopping support future use in a different design. bugs are allowed to unfreeze a design. In Better reuse is also the primary way in which addition, balancing the workload across the an agile hardware methodology can reduce different stages is difficult because engineers verification costs, because the verification focused in one functional area often lack infrastructure of these generators is also port- the training and experience to be effective able across designs. in other roles. An agile hardware methodology also strives In the agile hardware model, engineers to eliminate manual intervention in the chip’s work in collaborative, flexible teams that can build process. Replacing or fixing the many drive a feature through all implementation commercial CAD tools required to complete stages, avoiding most of the documentation a chip design is not practical, but most of overhead required to map a feature to the a chip’s flow can be effectively automated hardware. Each team can iterate on the high- with sufficient effort. Perhaps the most level architectural design with full visibility labor-intensive part of a modern chip flow is into low-level physical implementation and dealing with physical design issues for a new can validate system-level software impacts on technology node. However, most new chip an intermediate prototype. Figure 1b shows designs are in older technologies, because how multiple teams can work on different only a few products can justify the cost of features in a pipelined fashion, each iterating using an advanced node. Also, as technology through multiple levels of abstraction (as scaling continues to slow, back-end flows shown in Figure 1c). A given tape-in will and CAD tools will have time to mature accumulate some small number of feature and should become more automatable, even updates and integrate these for whole-system for the most advanced nodes. The end of validation. Engineers naturally develop a scaling will also coincide with a greater complete vertical skill set, and the focus of need for a variety of specialized designs, project management is on prioritizing fea- where the agile model should show greater tures for implementation, not communica- benefit. tion between silos. Our emphasis on tools and generators is not at odds with the agile software develop- Improvement of Tools and Generators over ment manifesto, which emphasizes collabora- Improvement of the Instance tion over tools. Our proposed methodology Modern software systems could not be built also values collaboration more than specific economically without the extreme reuse tools, but we feel that modern hardware enabled by extensive software libraries writ- design is far less automated and exhibits far ten in modern programming languages, or less reuse than modern software development without functioning compilers and an auto- (or even software development in 2001). mated build process. In contrast, most com- Accordingly, there is a need to emphasize the mercial hardware design flows rely on rather importance of reuse and automation enabled primitive languages to describe hardware, by new tools in the hardware design sphere, and frequent manual intervention on tool because few modern hardware design teams outputs to work around tool bugs and long emphasize these universally accepted software tool runtimes. The poor languages and buggy principles...... MARCH/APRIL 2016 11 ...... HOT CHIPS

Response to Change over Following a Plan ment. RISC-V is a modular architecture, with This principle mirrors the original manifesto. variants covering 32-, 64-, and 128-bit address Unlike software, a chip design will ultimately spaces. The base integer instruction set is lean, be frozen into a mask set and mass-manufac- requiring fewer than 50 user-level hardware tured, with no scope for frequent updates. instructions to support a full modern software However, even over the timeframe of a single stack, which enables microprocessor designers chip’s development, requirements will change toquicklybringupfullyfunctional prototypes owing to market shifts, customer feature and add additional features incrementally. In additions, or standards-body deliberations. addition to simplifying the implementation of In agile design, change can result from valida- new microarchitectures, the RISC-V design tion testing (as opposed to the waterfall provides an ideal base for custom accelerators. model, in which designs will only change in Not only can accelerators reuse common con- response to verification problems). trol-processor implementations, they can share An agile hardware development process a single software stack, including the compiler embraces continual change in chip specifica- toolchain and operating system binaries. This tions as a means to enhance competitiveness. dramatically reduces the cost of designing and Every incremental fabricatable prototype, not bringing up custom accelerators. only the most recent, becomes a starting Further information on RISC-V is avail- point for the changed set of design require- able from the nonprofit RISC-V Foundation, ments, with any change treated as a reprioriti- which was incorporated in August 2015 to zation of features to implement, drop, or help coordinate, standardize, protect, and pro- 3 rework. By first implementing features that mote the RISC-Vecosystem. Afreeandopen are unlikely to change, designers minimize ISA is a critical ingredient in an agile hardware the effort lost to late changes. methodology, because having an active and diverse community of open-source contribu- Implementing the Agile Hardware tors amortizes the overhead of maintaining and updating the software ecosystem, which Methodology makes it possible for smaller teams to focus on Here, we describe the steps we took toward developing custom hardware.4 AfreeISAis implementing an agile hardware design proc- also a prerequisite for shared open-source pro- ess. All of our processors, both general pur- cessor implementations, which can act as a pose and specialized, were built on the free base for further customization.5 and open RISC-V instruction architecture (ISA). We developed our RTL source code Chisel using Chisel (Constructing Hardware in a To facilitate agile hardware development by Scala-Embedded Language), an open-source improving designer productivity, we devel- hardware construction language, and ex- oped Chisel6 as a domain-specific extension to pressed the code as libraries of highly parame- the Scala . Chisel is terized hardware generators. We developed a not a high-level synthesis tool in which hard- highly automated back-end design flow to ware is inferred from Scala code. Rather, reduce manual effort in pushing designs Chisel is intended to be a substrate that pro- through to layout. Finally, we present the vides a Scala abstraction of primitive hardware resulting chips from our agile hardware design components, such as registers, muxes, and process. wires. Any Scala program whose execution generates a graph of such hardware compo- RISC-V nents provides a blueprint to fabricate hard- RISC-V is a free, open, and extensible ISA ware designs; the Chisel compiler translates a that forms the basis of all of our program- graph of hardware components into a fast, mable processor designs.2 Unlike many earlier cycle-accurate, bit-accurate Cþþ software efforts that designed open processor cores, simulator, or low-level synthesizable Verilog RISC-V is an ISA specification, intended to that maps to standard FPGA or ASIC flows. allow many different hardware implementa- Because Chisel is embedded in Scala, hard- tions to leverage common software develop- ware developers can tap into Scala’s modern ...... 12 IEEE MICRO masters(0) masters(1) AHB-Lite crossbar masters(0,1) master AHB-Lite bus AHB-Lite bus AHB-Lite bus

AHB-Lite crossbar slaves(0,1,2)

ins(0,1) slaves(0,1,2) Arbiter Arbiter Arbiter AHB-Lite slave mux Arbiter slave mux slave mux slave mux

out slaves(0) slaves(1) slaves(2) (a)

A class AHBXbar(nMasters: Int, nSlaves: Int) extends Module B { C val io = new Bundle { D val masters = Vec(new AHBMasterIO, nMasters).flip E val slaves = Vec(new AHBSlaveIO, nSlaves).flip F } G H val buses = List.fill(nMasters){Module(new AHBBus(nSlaves))} I val muxes = List.fill(nSlaves){Module(new AHBSlaveMux(nMasters))} J K (buses.map(b => b.io.master) zip io.masters) foreach { case (b, m) => b <> m } L (muxes.map(m => m.io.out) zip io.slaves) foreach { case (x, s) => x <> s } M for (m <- 0 until nMasters; s <- 0 until nSlaves) yield { N buses(m).io.slaves(s) <> muxes(s).io.ins(m) O } P } (b)

Figure 2. Using functional programming to write a parameterized AHB-Lite crossbar switch in Chisel. (a) Block diagrams. (b) Chisel source code. The code is short and easy to verify by inspection, but can support arbitrary numbers of masters and slave ports. programming language features—such as terized hardware generators rather than dis- object-oriented programming, functional pro- crete instances of individual blocks. This in gramming, parameterized types, abstract data turn improves code reuse within a given design types, operator overloading, and type infer- and across generations of design iterations. ence—to improve designer productivity by To see how these language features inter- raising the abstraction level and increasing act in real designs, we describe a crossbar that code reuse. Furthermore, metaprogramming, connects two AHB-Lite master ports to three code generation, and hardware design tasks are AHB-Lite slave ports in Figure 2a. In the fig- all expressed in the same source language, ure, the high-level block diagram is on the which encourages developers to write parame- left, the two building blocks (AHB-Lite buses ...... MARCH/APRIL 2016 13 ...... HOT CHIPS

idiom result A (1,2,3) map { n => n + 1 } (2,3,4) B (1,2,3) zip (a,b,c) ((1,a),(2,b),(3,c)) C ((1,a),(2,b),(3,c)) map { case (left,right) => left } (1,2,3) D (1,2,3) foreach { n => print(n) } 123 E for (x <- 1 to 3; y <- 1 to 3) yield (x,y) (1,1),(1,2),(1,3),(2,1),(2,2),(2,3), (3,1),(3,2),(3,3)

Figure 3. Generic functional programming idioms to help understand the crossbar example.

and AHB-Lite slave muxes) are in the mid- VHDL, would be difficult to generalize into a dle, and the final design is shown on the parameterizable crossbar generator. This is just right. To help understand the Chisel code, asmallexampleofChisel’spowertobring Figure 3 shows some generic functional pro- modern programming language constructs to gramming idioms. Idiom A maps a list of hardware development, increasing code main- numbers to an expression that adds 1 to tainability, facilitating reuse, and enhancing them. Idiom B zips two lists together. Idiom designer productivity. C iterates over a list of tuples, uses Scala’s case statement to provide names of elements, and Rocket Chip Generator then operates on those names. Idiom D iter- Chisel’s first-class support for object orienta- ates over a list when the output values are not tion and metaprogramming lets hardware collected. Idiom E is a for-comprehension with designers write generators, rather than indi- a yield statement. vidual instances of designs, which in turn Figure 2b shows the Chisel code that encourages the development of families of builds the crossbar. Line A declares the customizable designs. By encoding micro- AHBXbar module with two parameters: architectural and domain knowledge in these number of master ports (nMasters)and generators, we can quickly create different slave ports (nSlaves). Lines C through F chip instances customized for particular declare the module’s I/O ports. Lines H and I design goals and constraints.7 Construction instantiate collections of AHB-Lite buses and of hardware generators requires support for AHB-Lite slave muxes. Lines K through N reconfiguring individual components based connect the building blocks together. Note on the context in which they are deployed; that the Chisel operator <> connects mod- thus, a particular focus with Chisel is to ule I/O ports together. Line K extracts the improve on the limited module parameter- master I/O port of each bus, zips it to the cor- ization facilities of traditional HDLs. responding master I/O port of the crossbar The Rocket chip generator is written in itself, and then connects them together. Line Chisel and constructs a RISC-V-based plat- L does the same thing for each slave mux out- form. The generator consists of a collection put and each slave I/O port of the crossbar. of parameterized chip-building libraries that Lines M and N use for-comprehension to we can use to generate different SoC variants. generate the cross-product of all port combi- By standardizing the interfaces used to con- nations, and use each pair to connect a spe- nect different libraries’ generators to one cific slave I/O port of a specific bus to the another, we have created a plug-and-play matching input of the corresponding mux. environment in which it is trivial to swap out This code’s brevity ensures that it is easily substantial components of the design simply verified by inspection. Although the example by changing configuration files, leaving the shows two buses and three slave muxes, these hardware source code untouched. We can 10 lines of code would be identical for any also test the output of individual generators number of buses and slave muxes. In contrast, and perform integration tests on the whole implementations of this crossbar module writ- design, where the tests are also parameterized, ten in traditional HDLs, such as Verilog or so as to exercise the entire design under test...... 14 IEEE MICRO Figure 4 presents the collection of library generators and their interfaces within the RocketTile RocketTile Rocket chip generator. These generators can RoCC RoCC be parameterized and composed into many L1I$ L1I$ Rocket accel. Rocket accel. various SoC designs. Their current capabil- FPU FPU ities include the following: L1D$ L1D$ Core: the Rocket scalar core generator

with an optional floating-point unit, A Rocket configurable pipelining of functional L1 network units, and customizable branch- B Cache prediction structures. Caches: a family of cache and transla- C RoCC

tion look-aside buffer generators with L2$ bank D Tile configurable sizes, associativities, and replacement policies. E TileLink Rocket Custom Coprocessor: the RoCC interface, a template for application- L2 network F Periph. specific coprocessors that can expose their own parameters. TileLink/AXI4 Tile: a tile-generator template for bridge cache-coherent tiles. The number and type of cores and accelerators are AXI4 crossbar configurable, as is the organization of private caches. High- DRAM AHB and APB speed TileLink: a generator for networks of controller peripherals cache-coherent agents and the associ- I/O device ated cache controllers. Configuration options include the number of tiles, Figure 4. The Rocket chip generator comprises the following coherence policy, presence of shared subcomponents: (a) a core generator, (b) cache generator, (c) Rocket backing storage, and implementation Custom Coprocessor-compatible coprocessor generator, (d) tile generator, of underlying physical networks. (e) TileLink generator, and (f) peripherals. These generators can be Peripherals: generators for AXI4-/ parameterized and composed into many various system-on-chip designs. AHB-Lite-/APB-compatible buses and various converters and controllers. time. The deployment process for hardware Physical Design Flow is significantly more time-consuming than We frequently translate Chisel’s Verilog output for software. Using an initial hardware into a physical design through an ASIC CAD description to generate a GDS that is electri- toolflow.Wehaveautomatedthisflowandopti- cally and logically correct can require many mized it to generate industry-standard GDS- iterations. Because each iteration can take format mask files without designer intervention. many hours to run to completion, the process Automatic nightly regressions provide the could require weeks or months for larger designer with regular feedback on cycle time, designs. To avoid the productivity loss of area, and energy consumption, and they also these long-latency iterations, we initially detect physical design problems early in the focus on smaller designs, from which GDS design cycle. Automatic verification and report- can be generated more rapidly. Iterating on ing on the quality of results allow more points in these smaller-scale tape-ins has proven valua- the design space to be evaluated, and reduce the ble, because major problems are uncovered barriertoimplementingmoresignificantchanges much earlier in the design process, and even atlaterstagesinthedevelopmentprocess. more so when we target a new process tech- One major hindrance to agile methodol- nology. Additionally, the parameterizable ogy for hardware design is CAD tool run- nature of hardware generators reduces the ...... MARCH/APRIL 2016 15 ...... HOT CHIPS

Raven-3 Raven-1 Raven-2

Raven-4 Swerve May Apr. Aug. Feb. JulySept. Mar. Nov. Mar. Apr. 20112012 2013 2014 2015

EOS22 EOS24

EOS14 EOS16 EOS18 EOS20 Figure 5. Recent UC Berkeley tape-outs using RISC-V processors for three lines of research. The Raven series investigated low-voltage reliable operation with on-chip DC–DC converters, the EOS series developed monolithically integrated silicon photonic links, and Swerve experimented with dynamic redundancy techniques to handle low-voltage faults.

effort to scale up to larger designs later in the bine a 64-bit RISC-V vector microprocessor design process. Regardless of design size, we with on-chip switched-capacitor DC–DC ensure that each tape-in passes static timing converters and adaptive clocking. The 45-nm analysis, is functionally equivalent to the chips integrate a 64-bit dual-core RISC-V RTL model, and has only a small number of vector processor with monolithically inte- design-rule violations. Together, these prop- grated silicon photonic links.8,9 In total, we erties ensure that the design is indeed tape- taped out four Raven chips on STMicroelec- out ready. tronics’ 28-nm fully depleted silicon-on- As a particular tape-out deadline ap- insulator (FD-SOI) process, six photonics proaches, we evaluate whether to tape out the chips on IBM’s 45-nm SOI process, and one most recently taped-in design, usually on the Swerve chip on TSMC’s 28-nm process. The basis of whether the set of new features vali- Rocket chip generator forms the shared code dated by the tape-in is sufficiently large. If we base for these 11 distinct systems; we incor- decide to tape out, we clean up the remaining porated the best ideas from each design back design-rule violations and run final checks on into the code base, ensuring maximal reuse the resulting GDS. The full cycle of taping even though the three distinct families of out the GDS and testing the resulting chip is chips were specialized differently to test out more involved than a tape-in, but qualita- distinct research ideas. tively it is just another, longer, iteration in the agile methodology. Thus, we focus on pro- ducing lineages of increasingly complicated Case Study: RISC-V Vector Microprocessor functioning chips, incorporating feedback with On-chip DC–DC Converters from the previous chip into the next in the We applied our agile design methodology to series. We believe that this rapid iterative Raven-3, a 28-nm microprocessor that process is essential to improve the productiv- integrates switched-capacitor DC–DC con- ity of hardware designers and enable iterative verters, adaptive clocking circuitry, and low- design-space exploration of alternative system voltage static RAM macros with a RISC-V microarchitectures. vector processor instantiated by the Rocket chip generator.10 The successful development Silicon Results of the Raven-3 prototype demonstrates how Figure 5 presents a timeline of UC Berkeley architecture and circuit research was sup- RISC-V microprocessor tape-outs that we ported by the combination of the freely created using earlier versions of the Rocket extensible RISC-V ISA, the parameterizable chip generator. The 28-nm Raven chips com- Chisel-based hardware implementation, and ...... 16 IEEE MICRO To off-chip oscilloscope

2 RISC-V scalar core CORE (1.19 mm ) 1.8 V ... Scalar Vector accelerator V out RF FPU 1.0 V Vector issue unit int ... (16-Kbye vector RF uses eight custom 8T SRAM macros) 24 switched-capacitor Shared ... DC–DC unit cells functional units int int int int int 64-bit integer multiplier Crossbar

toggle Single-precision FMA Double-precision FMA Vector memory unit

+ Vref FSM – 16-Kbyte scalar 32-Kbyte shared 8-Kbyte vector I-cache D-cache I-cache DC–DC controller (Custom 8T (Custom 8T (Custom 8T SRAM macros) SRAM macros) SRAM macros) 2-Ghz ref clk Arbiter

Adaptive clock Asynchronous FIFO 1.0 V generator between domains

Wire-bonded chip-on-board Uncore

To/from off-chip FPGA FSB and DRAM

Figure 6. Raven-3 system-level block diagram. A RISC-V processor core with a vector coprocessor runs with varying voltage and frequency generated by on-chip switched-capacitor DC–DC converters and an adaptive clock generator.

fast feedback via iteration and automation as while requiring no off-chip passives, which part of the agile methodology. greatly simplifies package design.10 The clock Figure 6 shows the Raven-3 system-level generator selects clock edges from a tunable block diagram. The Raven-3 microprocessor replica circuit powered by the rippling core instantiates the left RocketTile in Figure 4 voltage that mimics the processor’s critical with the Hwacha vector unit implemented as path delay, and it adapts the clock frequency a RoCC accelerator. We did not include an to the instantaneous voltage.12 L2 cache in this design iteration, leaving this The design of Raven-3 benefitted greatly feature for implementation in future tape-ins from the agile hardware design methodology. and tape-outs. The chip consists of two volt- By aspiring to build an incomplete prototype age domains: the varying core domain and (rather than, for instance, a many-core design the fixed uncore domain. The core domain implementing these technologies), we dra- connects to the output of fully integrated matically improved the odds of project suc- noninterleaved switched-capacitor DC–DC cess. The unique integration of the power converters11 and powers the processor and its delivery, clocking, and processor systems was L1 caches. The fixed 1-V domain powers the only possible because the design team uncore. The DC–DC converters generate broadly shared expertise about all aspects of four voltage modes between 0.5 and 1 V the project. The processor RTL could largely from 1 and 1.8 V supplies; they achieve bet- be reused from previous tape-outs, with only ter than 80 percent conversion efficiency, incremental improvements and configuration ...... MARCH/APRIL 2016 17 ...... HOT CHIPS

Vector Vector RF VI$ RF VI$

SC DC-DC SC DC-DC Scalar core + Scalar core + vector accelerator vector accelerator

D$ I$ D$ I$

BIST BIST SC-DCDC Controller SC-DCDC Controller Unit cell Unit cell Adaptive clock Adaptive clock

(a) (b) (c) (d)

Figure 7. Raven-3 implementation: the (a) floorplan, (b) micrograph of the resulting chip, (c) wire-bonded chip on the package board, and (d) test setup.

changes required to validate it for the particu- than 20 ns. Raven-3 runs at a maximum lar feature set desired. As physical design on clock frequency of 961 MHz at 1 V, consum- partial feature sets gave incremental results, ing 173 mW, and at a minimum voltage of we were able to adjust design features to 0.45 V at 93 MHz, consuming 8 mW. improve feasibility and QoR. Raven-3 achieves a maximum energy effi- The agile hardware design methodology ciency of 27 and 34 Gflops per watt while was especially important in this design due to running a double-precision matrix-multipli- the complexity and risk introduced by the cation kernel on the vector accelerator with custom voltage conversion and clocking, as and without the switched-capacitor DC–DC illustrated by the following example. Tape- converters, respectively. These results validate ins of a simple digital system with a single the agile approach to hardware design. DC–DC unit cell were small enough to sim- ur agile hardware development meth- ulate at the transistor level, and these simula- odology played an important role in tions found a critical bug in the DC–DC O successfully taping out 11 RISC-V micro- controller. These transistor-level simulations processors in five years at UC Berkeley. With showed that a large current event could cause a combination of our agile methodology, the output voltage to remain below the Chisel, RISC-V, and the Rocket chip genera- lower-bound reference voltage even after the tor, a small team of graduate students at UC converter switched. A bug in the controller Berkeley was able to design multiple micro- would cause the converter to stop switching processors that are competitive with indus- and drop the processor voltage to zero for trial designs. Although there is much work this case. We believe this bug, which was left to do to show that this approach can scale fixed early in the design process by the addi- up to the largest SoCs, we hope these projects tion of a simple state machine in the lower- serve as an opportunity to shed some light on bound controller, would have been found inefficiencies in the current design process, to very late (if at all) with waterfall design rethink how modern SoCs are built, and to approaches. revitalize hardware design. MICRO Figure 7 shows the floorplan and micro- graph of the resulting chip, the wire-bonded chip on the package board, and the test setup. Acknowledgments The resulting 2:37-mm2 Raven chip is fully We thank Tom Burd, James Dunn, Olivier functional and boots and executes Thomas, and Andrei Vladimirescu for their compiled scalar and vector application code contributions and STMicroelectronics for with the adaptive clock generator at all oper- the fabrication donation. This work was ating modes, while the DC–DC converter funded in part by BWRC, ASPIRE, DARPA can transition between voltage modes in less PERFECT Award number HR0011-12-2- ...... 18 IEEE MICRO 0016, STARnet C-FAR, Intel ARO, AMD, Yunsup Lee is a PhD candidate in the SRC/TxACE, Marie Curie FP7, NSF Department of Electrical Engineering and GRFP,and an graduate fellowship. Computer Sciences at the University of California, Berkeley. He received an MS in ...... computer science from the University of References California, Berkeley. Contact him at 1. K. Beck et al., The Agile Manifesto, tech. [email protected]. report, Agile Alliance, 2001. 2. A. Waterman et al., The RISC-V Instruction Andrew Waterman is a PhD candidate in Set Manual, Volume I: User-Level ISA, Version the Department of Electrical Engineering 2.0, tech. report UCB/EECS-2014-54, EECS and Computer Sciences at the University of Dept., Univ. California, Berkeley, May 2014. California, Berkeley. He received an MS in computer science from the University of 3. R. Merritt, “, HP, Oracle Join RISC- California, Berkeley. Contact him at V,” EE Times, 28 Dec. 2015. [email protected]. 4. V. Patil et al., “Out of Order Floating Point Coprocessor for RISC-V ISA,” Proc. 19th Henry Cook is a PhD candidate in the Int’l Symp. VLSI Design and Test, 2015; Department of Electrical Engineering and doi:10.1109/ISVDAT.2015.7208116. Computer Sciences at the University of 5. K. Asanovic´ and D. Patterson, “The Case for California, Berkeley. He received an MS in Open Instruction Sets,” Microprocessor computer science from the University of Report, Aug. 2014. California, Berkeley. Contact him at [email protected]. 6. J. Bachrach et al., “Chisel: Constructing Hardware in a Scala Embedded Language,” Brian Zimmer is a research scientist in the Proc. Design Automation Conf., 2012, pp. Circuits Research Group at Nvidia. He 1216–1225. received a PhD in electrical engineering and 7. O. Shacham et al., “Rethinking Digital Design: computer science from the University of Why Design Must Change,” IEEE Micro,vol. California, Berkeley, where he performed 30, no. 6, 2010, pp. 9–24. the research for this article. Contact him at 8. Y. Lee et al., “A 45nm 1.3GHz 16.7 Double- brianzimmer@.com. Precision GFLOPS/W RISC-V Processor with Vector Accelerators,” Proc. 40th Euro- Ben Keller is a PhD student in the pean Solid-State Circuits Conf., 2014, pp. Department of Electrical Engineering and 199–202. Computer Sciences at the University of 9. C. Sun et al., “Single-Chip Microprocessor California, Berkeley. He received an MS in that Communicates Directly Using Light,” electrical engineering from the University of Nature, vol. 528, 2015, pp. 534–538. California, Berkeley. Contact him at [email protected]. 10. B. Zimmer et al., “A RISC-V Vector Pro- cessor with Tightly-Integrated Switched- Alberto Puggelli is the director of technol- Capacitor DC–DC Converters in 28nm ogy at Lion Semiconductor. He received a FDSOI,” Proc. IEEE Symp. VLSI Circuits, PhD in electrical engineering from the Uni- 2015, pp. C316–C317. versity of California, Berkeley, where he 11. H.-P. Le, S. Sanders, and E. Alon, “Design performed the research for this article. Techniques for Fully Integrated Switched- Contact him at alberto.puggelli@gmail. Capacitor DC–DC Converters,” IEEE J. com. Solid-State Circuits, vol. 46, no. 9, 2011, pp. 2120–2131. Jaehwa Kwak is a PhD candidate in the 12. J. Kwak and B. Nikolic´, “A 550-2260MHz Self- Department of Electrical Engineering and Adjustable Clock Generator in 28nm FDSOI,” Computer Science at the University of Proc. IEEE Asian Solid-State Circuits Conf., California, Berkeley. He received an MS in 2015; doi:10.1109/ASSCC.2015.7387471. electrical engineering and computer sciences ...... MARCH/APRIL 2016 19 ...... HOT CHIPS

from Seoul National University. Contact science from the University of California, him at [email protected]. Berkeley. Contact him at richards@eecs. berkeley.edu. Ruzica Jevtic´ is an assistant professor at the Universidad de Antonio de Nebrija in Jonathan Bachrach is an adjunct assistant Madrid. She received a PhD in electrical professor in the Department of Electrical engineering from Technical University of Engineering and Computer Sciences at the Madrid and was a postdoctoral fellow at the University of California, Berkeley. He University of California, Berkeley, where received a PhD in computer science from she performed the research for this article. the University of Massachusetts, Amherst. Contact her at [email protected]. Contact him at [email protected].

Stevo Bailey is a PhD student in the David Patterson is the Pardee Professor of Department of Electrical Engineering and Computer Science at the University of Cali- Computer Science at the University of fornia, Berkeley. He received a PhD in com- California, Berkeley. He received an MS from puter science from the University of Califor- the University of California, Berkeley. Con- nia, Los Angeles. Contact him at pattrsn@ tact him at [email protected]. eecs.berkeley.edu.

Milovan Blagojevic´ is a PhD student in a Elad Alon is an associate professor in the CIFRE program at the Berkeley Wireless Department of Electrical Engineering and Research Center, STMicroelectronics, and Computer Sciences at the University of Institut Superieur d’Electronique de Paris. California, Berkeley, and a codirector of the He received an MSc in electrical engineering Berkeley Wireless Research Center. He from the University of Belgrade, Serbia. received a PhD in electrical engineering Contact him at [email protected]. from Stanford University. Contact him at [email protected]. Pi-Feng Chiu is a PhD student in the Department of Electrical Engineering and Borivoje Nikolic´ is the National Semicon- Computer Sciences at the University of ductor Distinguished Professor of Engineer- California, Berkeley. She received an MS in ing at the University of California, Berkeley. electrical engineering from the National He received a PhD in electrical and com- Tsing Hua University. Contact her at puter engineering from the University of [email protected]. California, Davis. Contact him at bora@ eecs.berkeley.edu. Rimas Avizienis is a PhD candidate in the Department of Electrical Engineering and Krste Asanovic´ is a professor in the Depart- Computer Sciences at the University of ment of Electrical Engineering and Com- California, Berkeley. He received an MS in puter Sciences at the University of California, computer science from the University of Berkeley. He received a PhD in computer California, Berkeley. Contact him at science from the University of California, [email protected]. Berkeley. Contact him at [email protected].

Brian Richards is a research staff engineer at the University of California, Berkeley, and a founding member of the Berkeley Wireless Research Center. He received an MS in electrical engineering and computer

...... 20 IEEE MICRO