<<

IEEE MICRO Reflections from Uri Weiser p. 126 m a y/june 2017

The magazine for chip and silicon systems designers Top Picks from the 2016 Computer Architecture Conferences

May/June 2017 Volume 37, Number 3 VOL UME 37

NUM B E R 3 www.computer.org/micro

May/June 2017 Volume 37 Number 3

Features

6 Guest Editors’ Introduction: Top Picks from the 2016 Computer Architecture Conferences Aamer Jaleel and Moinuddin Qureshi 12 Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators Yu-Hsin Chen, Joel Emer, and Vivienne Sze 22 The Memristive Boltzmann Machines Mahdi Nazm Bojnordi and Engin Ipek 30 Analog Computing in a Modern Context: A Linear Algebra Accelerator Case Study Oliver Burston Yipeng Huang, Ning Guo, Mingoo Seok, Yannis Tsividis, and Simha Debut Art Sethumadhavan [email protected] 40 Domain Specialization Is Generally Unnecessary for Accelerators Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright 52 Configurable Clouds Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Daniel Firestone, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger 62 Specializing a Planet’s Computation: ASIC Clouds Moein Khazraee, Luis Vega Gutierrez, Ikuo Magaki, and Michael Bedford Taylor 70 DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York, NY 10016-5997; IEEE Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna T. Malladi, Hongzhong Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Zheng, Bob Brennan, and Christos Kozyrakis Computer Society Publications Office, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. 80 Agile Paging for Efficient Memory Virtualization Subscribe to IEEE Micro by visiting www.computer.org/micro. Jayneel Gandhi, Mark D. Hill, and Michael M. Swift Postmaster: Send address changes and undelivered copies to IEEE, Membership Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid 88 Transistency Models: Memory Ordering at the Hardware–OS Interface at New York, NY, and at additional mailing offices. Canadian GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Daniel Lustig, Geet Sethi, Abhishek Bhattacharjee, and Margaret Martonosi Return undeliverable Canadian addresses to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. 98 Toward a DNA-Based Archival Storage System Reuse rights and reprint permissions: Educational or personal use of this material is James Bornholt, Randolph Lopez, Douglas M. Carmean, Luis Ceze, Georg Seelig, and permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice Karin Strauss and a full citation to the original work on the first page of the copy; and 3) does not imply IEEE endorsement of any third-party products or services. Authors and their companies are permitted to post the accepted version of IEEE-copyrighted material on their own 106 Ti-states: Power Management in Active Timing Margin Processors webservers without permission, provided that the IEEE copyright notice and a full Yazhou Zu, Wei Huang, Indrani Paul, and Vijay Janapa Reddi citation to the original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate review 116 An Energy-Aware Debugger for Intermittently Powered Systems suggestions, but not the published version with copy-editing, proofreading, and for- Alexei Colin, Graham Harvey, Alanson P. Sample, and Brandon Lucia matting added by IEEE. For more information, please go to www.ieee.org /publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising, or promo- Departments tional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual Property Rights Office, 4 From the Editor in Chief 445 Hoes Lane, Piscataway, NJ 08854-4141 or [email protected]. Copyright 2017 by IEEE. All rights reserved. Thoughts on the Top Picks Selections Abstracting and library use: Abstracting is permitted with credit to the source. Lieven Eeckhout Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through 126 Awards the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Insights from the 2016 Eckert-Mauchly Award Recipient Editorial: Unless otherwise stated, bylined articles, as well as product and service descrip- tions, reflect the author’s or firm’s opinion. Inclusion in IEEE Micro does not necessarily Uri Weiser constitute an endorsement by IEEE or the Computer Society. All submissions are subject to Micro Economics editing for style, clarity, and space. IEEE prohibits discrimination, harassment, and bullying. 130 For more information, visit www.ieee.org/web/aboutus/whatis/policies/p9-26.html. Two Sides to Scale Shane Greenstein Computer Society Information, p. 3 Advertising/Product Index, p. 61 ...... 2 ...... EDITOR IN CHIEF EDITORIAL STAFF .... Lieven Eeckhout Editorial Product Lead Ghent University Cathy Martin [email protected] [email protected] ...... Editorial Management ADVISORY BOARD Molly Gamborg David H. Albonesi, Erik R. Altman, Pradip Bose, Publications Coordinator [email protected] Kemal Ebcioglu, Michael Flynn, Ruby B. Lee, ...... Yale Patt, James E. Smith, and Marc Tremblay ...... Director, Products & Services Evan Butterfield EDITORIAL BOARD Senior Manager, Editorial Services David Brooks Robin Baldwin Harvard University Manager, Editorial Services Alper Buyuktosunoglu Brian Brannon IBM Manager, Peer Review & Periodical Bronis de Supinski Administration Lawrence Livermore National Laboratory Hilda Carman Natalie Enright Jerger Digital Library Marketing Manager University of Toronto Georgann Carter Babak Falsafi EPFL Senior Business Development Manager Shane Greenstein Sandra Brown Northwestern University Director of Membership Lizy Kurian John Eric Berkowitz University of Texas at Austin Digital Marketing Manager Hyesoon Kim Marian Anderson Georgia Tech [email protected] ...... John Kim KAIST EDITORIAL OFFICE Hsien-Hsin (Sean) Lee PO Box 3014, Los Alamitos, CA 90720; Taiwan Semiconductor Manufacturing Company ...... (714) 821-8380; [email protected] ...... Richard Mateosian ...... Trevor Mudge ...... Submissions: University of Michigan, Ann Arbor Shubu Mukherjee https://mc.manuscriptcentral.com/micro-cs Cavium Networks Author guidelines: Onur Mutlu http://www.computer.org/micro ...... ETH Zurich Toshio Nakatani IEEE CS PUBLICATIONS BOARD IBM Research Greg Byrd (VP for Publications), Alfredo Benso, Irena Vojin G. Oklobdzija Bojanova, Robert Dupuis, David S. Ebert, Davide University of California, Davis Falessi, Vladimir Getov, Jose Martinez, Forrest Ronny Ronen Shull, and George K. Thiruvathukal ...... Kevin W. Rudd Laboratory for Physical Sciences IEEE CS MAGAZINE OPERATIONS Andre´ Seznec COMMITTEE INRIA George K. Thiruvathukal (Chair), Gul Agha, M. Brian Per Stenstro¨m Chalmers University of Technology Blake, Jim X. Chen, Maria Ebling, Lieven Eeckhout, Richard H. Stern Miguel Encarnac¸a˜o, Nathan Ensmenger, Sumi Helal, George Washington University Law School San Murugesan, Yong Rui, Ahmad-Reza Sadeghi, Lixin Zhang Diomidis Spinellis, VS Subrahmanian, and Mazin Chinese Academy of Sciences Yousif ......

...... MAY/JUNE 2017 3 From the Editor in Chief ...... Thoughts on the Top Picks Selections

LIEVEN EECKHOUT Ghent University

...... TheMay/JuneissueofIEEE The selection committee reached a many committee members flying in and Micro traditionally features a selection of consensus on 12 Top Picks and 12 Hono- out from all over the world). articles called Top Picks that have the rable Mentions. Top Pick selections Glancing over the set of papers potential to influence the work of com- were invited to prepare an article to be selected for Top Picks and Honorable puter architects for the near future. A included in this special issue. Because Mentions, one important trend has selection committee of experts selects these magazine articles are much shorter emerged just recently—namely, the these articles from the previous year’s than the original conference papers, they focus on accelerators and hardware computer architecture conferences; the tend to be more high-level and more specialization. A good number of papers selection criteria are novelty and potential qualitative than the original conference are related to in for long-term impact. Any paper published publications, providing an excellent intro- the broad sense. This does not come as in the top computer architecture confer- duction to these highly innovative contri- a surprise given current application ences of 2016 was eligible, which makes butions. The Honorable Mentions are top trends, along with the end of Dennard the job of the selection committee both a papers that the selection committee scaling, which pushes architects to challenge and a pleasure. Selections are unfortunately could not recognize as Top improve system performance within based on the original conference paper Picks because of magazine space con- stringent power and cost envelopes and a three-page write-up that summa- straints; these are acknowledged in the through hardware acceleration. We rizes the paper’s key contributions and Guest Editors’ Introduction. I encourage observe this trend throughout the entire potential impact. We received a record you to read these important contribu- computing landscape, from mobile devi- number of 113 submissions this year. tions to our field and share your thoughts ces to large-scale datacenters. There is a Aamer Jaleel and Moinuddin Qureshi with students and colleagues. lot of exciting research and advanced chaired the selection committee, which Having participated in the selection development going on in this area by comprised 33 experts. I wholeheartedly committee myself, I was deeply im- many research groups in industry and aca- thank them and their committee for pressed by the effectiveness of the demia, and I expect many more important having done such a great job. As they new review process. In particular, I advances in the near future. Next to this note in the Guest Editors’ Introduction, found it interesting to observe that the emerging trend, there is (still) a good frac- Aamer and Moin introduced a novel two- committee reached a consensus that tion of outstanding papers in more tradi- phase review procedure. Four commit- very closely aligned with the ranking tional areas, including microarchitecture, tee members reviewed each paper obtained by the 10 reviews for each of memory hierarchy, memory consistency, during the first round. A subset of the the second-round papers. This makes multicore, power management, security, papers was selected to move to the me wonder whether we still need an in- and simulation methodology. second round based on the reviewers’ person selection committee meeting. Iwanttoshareacouplemore scores and online discussion of the first Of course, the meeting itself has great thoughts with you regarding the Top Picks round. Six more committee members value in terms of generating interesting procedure that arose from conversations reviewed each paper during the second discussions and providing the opportu- I’ve had with various people in our com- round; second-round papers thus received nity to meet colleagues from our com- munity. I’d love to get the broader com- a total of 10 reviews! This formed the munity, but it undeniably also imposes munity’s feedback on this, so please basic input for the in-person selection a big cost in terms of time, effort, don’t hesitate to contact me and share committee meeting. money, and carbon footprint (with your thoughts......

4 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE One thought relates to the number of waiting a few more years before under- Uri Weiser single-handedly convinced selected Top Picks being too restrictive. standing the true value of a novel Intel executives to continue designing There is a hard cap of only 12 Top Picks. research contribution and how it impacts CISC-based processors by showing On one hand, we want the process to be our field. An important argument in this that through adding new features such selective and Top Picks recognition to be discussion is that awards are generally as superscalar execution, branch predica- prestigious. On the other hand, our com- more important to young researchers tion, split instruction, and data cache, the munity is growing. Our top-tier conferen- than they are for senior researchers. x86 processors could be made competi- ces, such as ISCA, MICRO, HPCA, and Young researchers looking for a faculty or tive against the RISC family of process- ASPLOS, receive an ever-increasing research position in a leading academic ors initiated by IBM and Berkeley. This number of papers to review, and the institute or industry lab need recognition laid the foundation for the Intel Pentium number of accepted papers is increasing fairly soon in their careers as they get in processor. Uri Weiser made several as well. One could argue that in response competition with other researchers from other seminal contributions, including we need to recognize more papers as other fields that have more awards. the design of instruction-set extensions Top Picks. The hard constraint that we Senior researchers, on the other hand, do (that is, Intel’s MMX) for supporting mul- are hitting here is the page limit we have not need the recognition as much—or at timedia applications. The Eckert-Mauchly for the magazine, because the number least their time scale is (much) longer. Award is considered the computer archi- of pages is related to the production Please let me know your thoughts on tecture community’s most prestigious cost. One solution may be to have more these ideas or any other concerns you award. I wholeheartedly congratulate Uri Top Picks selections but fewer pages may have. I’m open to any suggestions. Weiser on the award and thank him for allocated per selected article—but this My only concern is to make sure Top his insightful testimonial. may compromise the comprehensive- Picks continues to recognize the best With that, I wish you happy reading, ness of the articles. Another solution research in our field while serving the as always! may be to recognize more Honorable best interests of both the community Mentions, because they don’t affect the and IEEE Micro. Lieven Eeckhout page count. Or, we may want to elec- Before wrapping up, I want to high- Editor in Chief tronically publish the three-page Top light that this issue also includes an IEEE Micro Picks submissions (paper summary and award testimonial. Uri Weiser received potential impact, as mentioned earlier) as the 2016 Eckert-Mauchly Award for his Lieven Eeckhout is a professor in the they are, if the authors agree. This would seminal contributions to the field of com- Department of Electronics and Informa- not incur any production cost at all, yet puter architecture over the course of his tion Systems at Ghent University. Con- the community would benefit from read- 40-year career in industry and academia. tact him at [email protected]. ing them. Yet another solution may be to select more than 12 Top Picks and pub- lish them in different issues of the maga- zine. The counterargument here is that we have only six issues per year, which makes it difficult to argue for more than one issue devoted to Top Picks. Another issue relates to the timing of the Top Picks selection. Our community has relatively few awards, and Top Picks is an important vehicle in our community to recognize top-quality research. How- ever, one may argue whether selecting Top Picks one year after publication is too soon—it might make sense to wait a couple more years before recognizing the best research contributions of the year. We may not want to wait as long as the ISCA’s Influential Paper Award (15 years after publication) and MICRO’s Test of Time Award (18 to 22 years after publication), but still, one could argue for ...... MAY/JUNE 2017 5 ...... Guest Editors’ Introduction ...... TOP PICKS FROM THE 2016 COMPUTER ARCHITECTURE CONFERENCES

...... It is our pleasure to introduce this pared to previous years’ Top Picks, keeping year’s Top Picks in Computer Architecture. in mind the constraints and objectives that This issue is the culmination of the hard are unique to Top Picks. The conventional work of the selection committee, which chose approach to Top Picks selection has largely from 113 submissions that were published in remained similar to that used in our confer- computer architecture conferences in 2016. ences (for example, four to five reviews per We followed the precedent set by last year’s paper and a four-to-six-point grading scale). co-chairs and encouraged the selection com- For Top Picks, the number of papers that can mittee members to consider characteristics be accepted is fixed (11 to 12), and the selec- that make a paper worthy of being a “top tion committee’s primary job is to identify pick.” Specifically, we asked them to consider the top 12 papers out of all the submitted whether a paper challenges conventional papers, instead of providing a detailed cri- wisdom, establishes a new area of research, is tique of the technical work and how the the definitive “last word” in an established paper can be improved. The papers submit- research area, has a high potential for indus- ted to Top Picks tend to be of much higher try impact, and/or is one they would recom- (average) quality than the typical paper sub- mend to others to read. mitted at our conferences, and in many cases Since the number of papers that could be the reviewers are already aware of the work selected for this Top Picks special issue was (through prior reviewing, reading the papers, limited to 12, we continued the precedent set or attending the presentations). Therefore, Aamer Jaleel over the past two years of having the selection the time and effort spent reviewing Top Picks committee recognize 12 additional high- papers tends to be less than that spent review- quality papers for Honorable Mention. We ing the typical conference submissions. strongly encourage you to read these papers We identified two key areas in which the (see the “Honorable Mentions” sidebar). Top Picks selection process could be Moinuddin Qureshi Before we present the list of articles appearing improved. First, a small number of reviewers in this special issue, we will first describe the (approximately five) made the decisions for Georgia Tech new review process that we implemented to Top Picks. The confidence in selection could improve the paper selection process. be improved significantly by having a larger number of reviews (approximately 10) per paper, especially for the papers that are likely Review Process to be discussed at the selection committee A selection committee comprising 31 mem- meeting. This also ensures that reviewers are bers reviewed all the 113 papers (see the more engaged at the meeting and make “Selection Committee” sidebar). This year, informed decisions. Second, the selection of we tried a different selection process com- Top Picks gets overly influenced by excessively ......

6 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE ...... Honorable Mentions

Paper Summary “Exploiting Semantic Commutativity in Hardware Speculation” This paper introduces architectural support to exploit a broad class by Guowei Zhang, Virginia Chiu, and Daniel Sanchez (MICRO of commutative updates enabling update-heavy applications to 2016) scale to thousands of cores. “The Computational Sprinting Game” by Songchun Fan, Seyed Computational sprinting is a mechanism that supplies extra power Majid Zahedi, and Benjamin C. Lee (ASPLOS 2016) for short durations to enhance performance. This paper introduces game theory for allocating shared power between multiple cores. “PoisonIvy: Safe Speculation for Secure Memory” by Tamara Integrity verification is a main cause of slowdown in secure memo- Silbergleit Lehman, Andrew D. Hilton, and Benjamin C. Lee ries. PoisonIvy provides a way to enable safe speculation on unveri- (MICRO 2016) fied data by tracking the instructions that consume the unverified data using poisoned bits. “Data-Centric Execution of Speculative Parallel Programs” by The authors’ technique enables speculative parallelization (such as Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera, thread-level speculation and transactional memory) to scale to thou- Joel Emer, and Daniel Sanchez (MICRO 2016) sands of cores. It also makes speculative parallelization as easy to program as sequential programming. “Efficiently Scaling Out-of-Order Cores for Simultaneous This paper demonstrates that it is possible to unify in-order and Multithreading” by Faissal M. Sleiman and Thomas F. out-of-order issue into a single, integrated, energy-efficient SMT Wenisch (ISCA 2016) microarchitecture. “Racer: TSO Consistency via Race Detection” by Alberto Ros The authors propose a scalable approach to enforce coherence and and Stefanos Kaxiras (MICRO 2016) TSO consistency without directories, timestamps, or intervention. “The Anytime Automaton” by Joshua San Miguel and Natalie This paper provides a general, safe, and robust approximate com- Enright Jerger (ISCA 2016) puting paradigm that abstracts away the challenge of guaranteeing user acceptability from the system architect. “Accelerating Markov Random Field Inference Using Molecular This paper proposes cross-layer support for probabilistic computing Optical Gibbs Sampling Units” by Siyang Wang, Xiangyu Zhang, using novel technologies and specialized architectures. Yuxuan Li, Ramin Bashizade, Song Yang, Chris Dwyer, and Alvin R. Lebeck (ISCA 2016) “Stripes: Bit-Serial Deep Neural Network Computing” by The authors demonstrate that bit-serial computation can lead to Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. high-performance and energy-efficient designs whose performance Aamodt, and Andreas Moshovos (MICRO 2016) and accuracy adapts to precision at a fine granularity. “Strober: Fast and Accurate Sample-Based Energy Simulation This paper proposes a sample-based RTL energy modeling method- for Arbitrary RTL” by Donggyu Kim, Adam Izraelevitz, Christo- ology for fast and accurate energy evaluation. pher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanovicc (ISCA 2016) “Back to the Future: Leveraging Belady’s Algorithm for The authors’ algorithm enhances cache replacement by learning Improved Cache Replacement” by Akanksha Jain and Calvin replacement decisions made by Belady. The paper also presents a Lin (ISCA 2016) novel mechanism to efficiently simulate Belady behavior. “ISAAC: A Convolutional Neural Network Accelerator with The authors advance the state of the art in deep network accelera- In-Situ Analog Arithmetic in Crossbars” by Ali Shafiee, Anir- tors by an order of magnitude and overcome the challenges of ana- ban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, log-digital conversion with innovative encodings and pipelines John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek suitable for precise and energy-efficient analog acceleration. Srikumar (ISCA 2016)

...... MAY/JUNE 2017 7 ...... GUEST EDITORS’INTRODUCTION

...... Selection Committee  Tor Aamodt, University of British Columbia  Debbie Marr, Intel  Alaa Alameldeen, Intel  Andreas Moshovos, University of Toronto  Murali Annavaram, University of Southern California  Onur Mutlu, ETH Zurich  Todd Austin, University of Michigan  Ravi Nair, IBM  Chris Batten, Cornell University  Milos Prvulovic, Georgia Tech  Luis Ceze, University of Washington  Scott Rixner, Rice University  Sandhya Dwarkadas, University of Rochester  Eric Rotenberg, North Carolina State University  Lieven Eeckhout, Ghent University  Karu Sankaralingam, University of Wisconsin  Joel Emer, Nvidia and MIT  Yanos Sazeidas, University of Cyprus  Babak Falsafi, EPFL  Simha Sethumadhavan, Columbia University  Hyesoon Kim, Georgia Tech  Andre Seznec, INRIA  Nam Sung Kim, University of Illinois at Urbana–Champaign  Dan Sorin, Duke University  Benjamin Lee, Duke University  Viji Srinivasan, IBM  Hsien-Hsin Lee, Taiwan Semiconductor Manufacturing  Karin Strauss, Company  Tom Wenisch, University of Michigan  Gabriel Loh, AMD  Antonia Zhai, University of Minnesota

harsh or generous reviewers, who either give or Honorable Mention, or neither. In the first scores at extreme ends or advocate for too phase, each reviewer was assigned exactly 14 few or too many papers from their stack. We papers and was asked to recommend exactly wanted to ensure that all reviewers play an five papers (Top 5) to the second phase. Each equal role in the selection, regardless of their paper received four ratings in this phase. If a harshness or generosity. For example, we paper got three or more ratings of Top 5, it could give all reviewers an equal voice by automatically advanced to the second phase. requiring them to advocate for a fixed num- If the paper had two ratings of Top 5, then ber of papers from their stack. We used the both positive reviewers had to champion the data from the past three years’ Top Picks paper for it to advance to the second phase. meetings to analyze the process for Top Picks Papers with less than two ratings of Top 5 did and used this data to drive the design of our not advance to the second phase. A total of process. For example, the typical acceptance 38 papers advanced to the second phase, and rate of Top Picks is approximately 10 per- each such paper got a total of 9 to 10 reviews. cent; therefore, if we assign 15 papers to each In the second phase, each reviewer was reviewer, then each reviewer can be expected assigned an additional seven to eight papers to have only 1.5 Top Picks papers on average in addition to the four to five papers that sur- in their stack, and the likelihood of having 5 vived the first phase. Each reviewer had 12 or more Top Picks papers in the stack would papers and was asked to place exactly 4 of be extremely small. them into each category: Top Picks, Honora- Based on the data and constraints of Top ble Mention, and neither. Picks, we formulated a ranking-based two- The selection committee meeting was phase process. The objective of the first phase held in person in Atlanta, Georgia, on was to filter about 35 to 40 papers that would 17 December 2016. At the selection com- be discussed at the selection committee meet- mittee meeting, the 38 papers were rank- ing. The objective of the second phase was to ordered on the basis of the number of Top increase the number of reviews per paper to Picks votes and the average rating the paper about 10 and ask each reviewer to provide a received in the second phase. If, after the in- concrete decision for the assigned paper: person discussion, 60 percent or more whether it should be selected as a Top Picks reviewers rated a paper as a Top Pick, then ...... 8 IEEE MICRO the paper was selected as a Top Pick. Other- Optimize Energy Efficiency of Deep Neural wise, the decision to select the paper as a Top Network Accelerators” by Yu-Hsin Chen and Pick (or Honorable Mention or neither) was his colleagues describes a spatial architecture made by a committee-wide vote using a sim- that optimizes the dataflow for energy effi- ple majority. We observed that the top eight ciency. This article also has an insightful ranked papers all got accepted as Top Picks, framework for classifying different accelera- and four more papers were selected as Top tors based on access patterns. Picks from the next nine papers. Overall, out “The Memristive Boltzmann Machines” of the top 25 papers, all but one was selected by Mahdi Nazm Bojnordi and Engin Ipek as either a Top Pick or an Honorable Men- proposes a memory-centric hardware acceler- tion. Thus, having a large number of reviews ator for combinatorial optimization and deep per paper reduced the dependency on the in- learning that leverages in-situ computing of person discussion. Coincidentally, the day bit-line computation in memristive arrays before the selection committee meeting there to eliminate the need for exchanging data was a hurricane, which caused many flights among the memory arrays and the computa- to be canceled, and 4 of the 31 selection com- tional units. mittee members were unable to attend the The concept of using analog computing meeting. However, having 9 to 10 reviewers for efficient computation is also explored by per paper still ensured that there were at least Yipeng Huang and colleagues in “Analog eight reviewers present for each paper dis- Computing in a Modern Context: A Linear cussed at the selection committee meeting, Algebra Accelerator Case Study.” The authors resulting in a robust and high-confidence try to address the typical challenges faced by process, even with a relatively high rate of analog computing, such as limited problem absentees. Given the unique constraints and size, limited dynamic range, and precision. objectives of Top Picks, we hope that such a In contrast to the first three articles, which process with a larger number of reviews per use domain-specific acceleration, “Domain paper and a process that is robust to variation Specialization Is Generally Unnecessary For in generosity levels of reviewers (for example, Accelerators” by Tony Nowatzki and his col- ranking papers into fixed-sized bins) will be leagues focuses on retaining the programm- useful for future Top Picks selection commit- ability of accelerators while maintaining their tees as well. energy efficiency. The authors use an architec- ture that has a large number of tiny cores with Selected Papers key building blocks typically required for accelerators and configure these cores intelli- With the slowing down of conventional gently based on the domain requirement. means for improving performance, the archi- tecture community has been investigating Large-Scale Accelerators accelerators to improve performance and The next three articles look at enhancing the energy efficiency. This was evident in the scalability of accelerators so that they can emergence of a large number of papers on handle larger problem sizes and cater to vary- accelerators appearing throughout the archi- ing problem domains. The article tecture conferences in 2016. Given the “Configurable Clouds” by Adrian Caulfield emphasis on accelerators, it is no surprise that and his colleagues describes a cloud-scale more than half of the articles in this issue acceleration architecture that can connect dif- focus on architecting accelerators. Memory ferent accelerator nodes within a datacenter system and energy considerations are two using a high-speed FPGA fabric that lets the other areas from which the Top Picks papers system accelerate a wide variety of applica- were selected. tions and has been deployed in Microsoft datacenters. Accelerators In “Specializing a Planet’s Computation: Data movement is a primary factor that ASIC Clouds,” Moein Khazraee and his col- determines the energy efficiency and effec- leagues target scale-out workloads comprising tiveness of accelerators. “Using Dataflow to many independent but similar jobs, often on ...... MAY/JUNE 2017 9 ...... GUEST EDITORS’INTRODUCTION

behalf of many users. This architecture shows perature inversion. In the article “Ti-states: a way to make ASIC usage more economical, Power Management in Active Timing Mar- because different users can potentially share gin Processors,” Yazhou Zu and his col- the cost of fabricating a given ASIC, rather leagues show how actively monitoring the than each design team incurring the cost of temperature on the chip and dynamically fabricating the ASIC. reducing this timing margin can result in sig- “DRAF: A Low-Power DRAM-Based nificant power savings. Reconfigurable Acceleration Fabric” by Min- Energy harvesting systems represent an gyu Gao and his colleagues describes a way to extreme end of energy-constrained comput- increase the size of FPGA fabrics at low cost ing in which the system performs computing by using DRAM instead of SRAM for the only when the harvested energy is present. storage inside the FPGA, thereby enabling a One challenge in such systems is to provide high-density and low-power reconfigurable debugging functionality for software, because fabric. system failure could happen due to either lack of energy or incorrect code. “An Energy- Memory and Storage Systems Aware Debugger for Intermittently Powered Memory systems continue to be important in Systems” by Alexei Colin and his colleagues determining the performance and efficiency describes a hardware–software debugger for of computer systems. This issue features three an intermittent energy-harvesting system that articles that focus on improving memory and can allow software verification to proceed storage systems. “Agile Paging for Efficient without getting interference from the energy- Memory Virtualization” by Jayneel Gandhi harvesting circuit. and his colleagues addresses the performance overhead of virtual memory in virtualized environments by getting the best of both e hope you enjoy reading these articles worlds: nested paging and shadow paging. W and that you will explore both the Virtual address translation can some- original conference versions and the Honora- times affect the correctness of memory con- ble Mention papers. We welcome your feed- sistency models. Daniel Lustig and his back on this special issue and any suggestions colleagues address this problem in their article, fornextyear’sTopPicksissue. MICRO “Transistency Models: Memory Ordering at the Hardware–OS Interface.” The authors Acknowledgments propose to rigorously integrate memory con- We thank Lieven Eeckhout for providing sistency models and address translation at the support and direction as we tried out the microarchitecture and levels. new paper selection process. Lieven also Moving on to the storage domain, in handled the papers that were conflicted with “Toward a DNA-Based Archival Storage Sys- both co-chairs. We also thank the selection tem,” James Bornholt and his colleagues committee co-chairs for the past three Top demonstrate DNA-based storage architected Picks issues (Gabe Loh, Babak Falsafi, Luis as a key-value store. Their design enables ran- Ceze, Karin Strauss, Milo Martin, and Dan dom access and is equipped with error correc- Sorin) for providing the review statistics from tion capability to handle the imperfections of their editions of Top Picks and for answering the read and write process. As the demand our questions. We thank Vinson Young for for cheap storage continues to increase, such handling the submission website and Pra- alternative technologies have the potential to shant Nair and Jian Huang for facilitating the provide a major breakthrough in storage process at the selection committee meeting. capability. We owe a huge thanks to our fantastic selec- tion committee, which not only diligently Energy Considerations reviewed all the papers but also were suppor- The final two articles are related to optimiz- tive of the new review process. Furthermore, ing energy or operating under low energy the selection committee members spent a day budgets. Modern processors are provisioned attending the in-person meeting in Atlanta, with a timing margin to protect against tem- fairly close to the holiday season. Finally, we ...... 10 IEEE MICRO thank all the authors who submitted their Moinuddin Qureshi is an associate profes- work for consideration to this Top Picks issue sor in the School of Electrical and Com- and the authors of the selected papers for pro- puter Engineering at Georgia Tech. Contact ducing the final versions of their papers for him at [email protected]. this issue.

Aamer Jaleel is a principal research scientist Read your subscriptions through at Nvidia. Contact him at ajaleel@nvidia the myCS publications portal at http://mycs.computer.org. .com.

...... MAY/JUNE 2017 11 ...... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS

......

THE AUTHORS DEMONSTRATE THE KEY ROLE DATAFLOWS PLAY IN OPTIMIZING ENERGY

EFFICIENCY FOR DEEP NEURAL NETWORK (DNN) ACCELERATORS.THEY INTRODUCE BOTH A

SYSTEMATIC APPROACH TO ANALYZE THE PROBLEM AND A NEW DATAFLOW, CALLED

ROW-STATIONARY, THAT IS UP TO 2.5 TIMESMOREENERGYEFFICIENTTHANEXISTING

Yu-Hsin Chen DATAFLOWS IN PROCESSING A STATE-OF-THE-ART DNN. THIS ARTICLE PROVIDES

Massachusetts Institute of GUIDELINES FOR FUTURE DNN ACCELERATOR DESIGNS. Technology ...... Recent breakthroughs in deep use high parallelism to achieve high process- Joel Emer neural networks (DNNs) are leading to an ing throughput. However, this processing industrial revolution based on AI. The super- also requires a significant amount of data Nvidia and Massachusetts ior accuracy of DNNs, however, comes at movement: each MAC performs three reads the cost of high computational complexity. and one write of data access. Because moving Institute of Technology General-purpose processors no longer deliver data can consume more energy than the sufficient processing throughput and energy computation itself,1 optimizing data move- Vivienne Sze efficiency for DNNs. As a result, demands ment becomes key to achieving high energy for dedicated DNN accelerators are increas- efficiency. Massachusetts Institute of ing in order to support the rapidly growing Data movement can be optimized by use of AI. exploiting data reuse in a multilevel storage Technology The processing of a DNN mainly com- hierarchy. By maximizing the reuse of data in prises multiply-and-accumulate (MAC) oper- the lower-energy-cost storage levels (such as ations (see Figure 1). Most of these MACs are local scratchpads), thus reducing data accesses performed in the DNN’s convolutional to the higher-energy-cost levels (such as layers, in which multichannel filters are con- DRAM), the overall data movement energy volved with multichannel input feature maps consumption is minimized. (ifmaps, such as images). This generates par- In fact, DNNs present many data reuse tial sums (psums) that are further accumu- opportunities. First, there are three types lated into multichannel output feature maps of input data reuse: filter reuse, wherein (ofmaps). Because the MAC operations have each filter weight is reused across multiple few data dependencies, DNN accelerators can ifmaps; ifmap reuse, wherein each ifmap ......

12 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE pixel is reused across multiple filters; and convolutional reuse, wherein both ifmap pix- Input feature maps (ifmaps) els and filter weights are reused due to the Output feature maps C (ofmaps) sliding-window processing in convolutions. Filters M Second, the intermediate psums are reused C through the accumulation of ofmaps. If not H Partial E accumulated and reduced as soon as possi- R sums (psums) ble, the psums can pose additional storage 1 1 1 R H pressure. . . E. . . . A design can exploit these data reuse . . . opportunities by finding the optimal MAC operation mapping, which determines both C C M the temporal and spatial scheduling of the MACs on a highly parallel architecture. R E H Ideally, data in the lower-cost storage levels is M N reused by as many MACs as possible before R N E replacement. However, due to the limited H amount of local storage, input data reuse (ifmaps and filters) and psum reuse cannot Figure 1. In the processing of a deep neural network (DNN), multichannel be fully exploited simultaneously. For exam- filters are convolved with the multichannel input feature maps, which then ple, reusing the same input data for multiple generate the output feature maps. The processing of a DNN comprises MACs generates psums that cannot be accu- many multiply-and-accumulate (MAC) operations. mulated together and, as a result, consume extra storage space. Therefore, the system energy efficiency is maximized only when the Because state-of-the-art DNNs come in a mapping balances all types of data reuse in a wide range of shapes and sizes, the corre- multilevel storage hierarchy. sponding optimal mappings also vary. The Thesearchforthemappingthatmaxi- question is, can we find a dataflow that accom- mizes system energy efficiency thus becomes modates the mappings that optimize data an optimization process. This optimization movement for various DNN shapes and sizes? must consider the following factors: the data In this article, we explore different DNN reuse opportunities available for a given dataflows to answer this question in the con- DNN shape and size (for example, the num- text of a spatial architecture.2 In particular, we ber of filters, number of channels, size of fil- will present the following key contributions:3 ters, and feature map size), the energy cost of data access at each level of the storage hier- An analogy between DNN accelera- archy, and the available processing parallelism tors and general-purpose processors and storage capacity. The first factor is a func- that clearly identifies the distinct tion of workload, whereas the second and aspects of operation of a DNN accel- third factors are a function of the specific erator, which provides insights into accelerator implementation. opportunities for innovation. Because of implementation tradeoffs, pre- A framework that quantitatively eval- vious proposals for DNN accelerators have uates the energy consumption of dif- made choices on the subset of mappings that ferent mappings for different DNN can be supported. Therefore, for a specific shapes and sizes, which is an essential DNN accelerator design, the optimal map- tool for finding the optimal mapping. ping can be selected only from the subset of A taxonomy that classifies existing supported mappings instead of the entire dataflows from previous DNN accel- mapping space. The subset of supported erator projects, which helps to under- mappings is usually determined by a set of stand a large body of work despite mapping rules, which also characterizes the differences in the lower-level details. hardware implementation. Such a set of map- A new dataflow, called Row-Stationary pingrulesdefinesadataflow. (RS), which is the first dataflow to ...... MAY/JUNE 2017 13 ...... TOP PICKS

tecture. Similar to the role of an ISA or memory consistency model, the dataflow Compilation Execution defines the mapping rules that the mapper DNN shape and size must follow in order to generate hardware- (Program) Processed compatible mappings. Later in this article, data we will introduce several previously pro- Dataflow, ... posed dataflows. (Architecture) Other attributes of a DNN accelerator, Mapper DNN accelerator such as the storage organization, also are () (Processor) analogous to parts of the general-purpose Implementation details processor architecture, such as scratchpads or (μArch) virtual memory. We consider these attributes Mapping Input (Binary) data part of the architecture, instead of microarch- itecture, because they may largely remain invariant across implementations. Although, Figure 2. An analogy between the operation of DNN accelerators (roman similar to GPUs, the distinction between text) and that of general-purpose processors (italicized text). architecture and microarchitecture is likely to blur for DNN accelerators. Implementation details, such as those that optimize data movement for superior determine access energy cost at each level of system energy efficiency. It has also the storage hierarchy and latency between been verified in a fabricated DNN processing elements (PEs), are analogous to accelerator chip, Eyeriss.4 the microarchitecture of processors, because a mapping will be valid despite changes in We evaluate the energy efficiency of the these characteristics. However, they play a RS dataflow and compare it to other data- vital part in determining a mapping’s energy flows from the taxonomy. The comparison efficiency. uses a popular state-of-the-art DNN model, 5 The mapper’s goal is to search in the map- AlexNet, with a fixed amount of hardware ping space for the mapping that best opti- resources. Simulation results show that the mizes data movement. The size of the entire RS dataflow is 1.4 to 2.5 times more energy mapping space is determined by the total efficient than other dataflows in the convolu- number of MACs, which can be calculated tional layers. It is also at least 1.3 times more from the DNN shape and size. However, energy efficient in the fully connected layers only a subset of the space is valid given the for batch sizes of at least 16. These results mapping rules defined by a dataflow. For will provide guidance for future DNN accel- example, the dataflow can enforce the follow- erator designs. ing mapping rule: all MACs that use the same filter weight must be mapped on the An Analogy to General-Purpose Processors same PE in the accelerator. Then, it is the Figure 2 shows an analogy between the oper- mapper’s job to find out the exact ordering of ation of DNN accelerators and general- these MACs on each PE by evaluating and purpose processors. In conventional computer comparing the energy efficiency of different systems, the compiler translates a program valid ordering options. into machine-readable binary codes for exe- As in conventional , performing cution; in the processing of DNNs, the map- evaluation is an integral part of the mapper. per translates the DNN shape and size into a The evaluation process takes a certain map- hardware-compatible mapping for execu- ping as input and gives an energy consump- tion. While the compiler usually optimizes tion estimation based on the available for performance, the mapper especially opti- hardware resources (microarchitecture) and mizes for energy efficiency. data reuse opportunities extracted from the The dataflow is a key attribute of a DNN DNN shape and size (program). In the next accelerator and is analogous to one of the section, we will introduce a framework that parts of a general-purpose processor’s archi- can perform this evaluation...... 14 IEEE MICRO Evaluating Energy Consumption Finding the optimal mapping requires evalu- CPU iFIFO/oFIFO PE array Accelerator chip ation of the energy consumption for various GPU mapping options. In this article, we evaluate Off-chip energy consumption based on a spatial archi- DRAM tecture,2 because many of the previous Global buffer designs can be thought of as instances of such an architecture. The spatial architecture (see Figure 3) consists of an array of PEs and a iFIFO/oFIFO PE array multilevel storage hierarchy. The PE array (zoom in) provides high parallelism for high through- put, whereas the storage hierarchy can be RF RF RF used to exploit data reuse in a four-level setup pFIFO pFIFO (in decreasing energy-cost order): DRAM, pFIFO global buffer, network-on-chip (NoC, for inter-PE communication), and register file (RF) in the PE as local scratchpads. Figure 3. Spatial array architecture comprises an array of processing In this architecture, we assume all data elements (PEs) and a multilevel storage hierarchy, including the off-chip types can be stored and accessed at any level DRAM, global buffer, network-on-chip (NoC), and register file (RF) in the PE. of the storage hierarchy. Input data for the The off-chip DRAM, global buffer, and PEs in the array can communicate MAC operations—that is, filter weights and with each other directly through the input and output FIFOs (the iFIFO and ifmap pixels—are moved from the most oFIFO). Within each PE, the PE FIFO (pFIFO) controls the traffic going in and expensive level (DRAM) to the lower-cost out of the arithmetic logic unit (ALU), including from the RF or other storage levels. Ultimately, they are usually delivered levels. from the least expensive level (RF) to the arithmetic logic unit (ALU) for computation. The results from the ALU—that is, psums— generally move in the opposite direction. Normalized energy cost The orchestration of this movement is deter- mined by the mappings for a specific DNN Computation 1 MAC at ALU 1× (Reference) shape and size under the mapping rule con- RF (0.5 to 1.0 Kbytes) 1× straints of a specific dataflow architecture. Given a specific mapping, the system NoC (1 to 2 mm) 2× Data access energy consumption is estimated by account- Global buffer × (>100 Kbytes) 6 ing for the number of times each data value × from all data types (ifmaps, filters, psums) is DRAM 200 reused at each level of the four-level memory hierarchy, and weighing it with the energy Figure 4. Normalized energy cost relative to the computation of one MAC cost of accessing that specific storage level. operation at ALU. Numbers are extracted from a commercial 65-nm Figure 4 shows the normalized energy con- process. sumption of accessing data from each storage level relative to the computation of a MAC at each PE can hold only one ifmap pixel at a the ALU. We extracted these numbers from a time. The mapping first reads an ifmap pixel commercial 65-nm process and used them in from DRAM to the global buffer, then from our final experiments. the global buffer to the RF of each PE Figure 5 uses a toy example to show how a through the NoC, and reuses it from the RF mapping determines the data reuse at each for four MACs consecutively in each PE. The storage level, and thus the energy consump- mapping then switches to MACs that use tion, in a three-PE setup. In this example, we other ifmap pixels, so the original one in the have the following assumptions: each ifmap RF is replaced by new ones, due to limited pixel is used by 24 MACs, all ifmap pixels capacity. Therefore, the original ifmap pixel can fit into the global buffer, and the RF of must be fetched from the global buffer again ...... MAY/JUNE 2017 15 ...... TOP PICKS

valid mapping space based on how they han- dle data. As a result, we can classify them into iFIFO/oFIFO PE array ataxonomy.

RF RF RF The Weight-Stationary (WS) data- Global flow keeps filter weights stationary in DRAM buffer pFIFO pFIFO pFIFO each PE’s RF by enforcing the follow- ing mapping rule: all MACs that use Memory level Buffer level the same filter weight must be NoC level mapped on the same PE for process- ing serially. This maximizes the con- RF level volutional and filter reuse of weights Ifmap pixel in the RF, thus minimizing the data movement Processing other data . . . energy consumption of accessing weights (for example, work by Srimat Chakradhar and colleagues6 and 7 time Vinayak Gokhale and colleagues ). Figure 6a shows the data movement of a common WS dataflow imple- Figure 5. Example of how a mapping determines data reuse at each storage mentation. While each weight stays level. This example shows the data movement of one ifmap pixel going in the RF of each PE, the ifmap pixels through the storage hierarchy. Each arrow means moving data between are broadcast to all PEs, and the gen- specific levels (or to an ALU for computation). erated psums are then accumulated spatially across PEs. when the mapping switches back to the The Output-Stationary (OS) data- MACs that use it. In this case, the same flow keeps psums stationary by accu- ifmap pixel is reused at the DRAM, global mulating them locally in the RF. The buffer, NoC, and RF for 1, 2, 6, and 24 mapping rule is that all MACs that times, respectively. The corresponding nor- generate psums for the same ofmap malized energy consumption of moving this pixel must be mapped on the same ifmap pixel is obtained by weighing these PE serially. This maximizes psum numbers with the normalized energy num- reuse in the RF, thus minimizing bers in Figure 4 and then adding them energy consumption of psum move- ment (for example, work by Zidong together (that is, 1 200 þ 2 6 þ 6 2 8 þ 24 1 ¼ 248). For other data types, the Du and colleagues, Suyog Gupta and colleagues,9 and Maurice Pee- same approach can be applied. 10 This analysis framework can be used not men and colleagues ). The data only to find the optimal mapping for a spe- movement of a common OS dataflow cific dataflow, but also to evaluate and com- implementation is to broadcast filter pare the energy consumption of different weights while passing ifmap pixels dataflows. In the next section, we will spatially across the PE array (see describe various existing dataflows. Figure 6b). Unlike the previous two dataflows, which keep a certain data type sta- A Taxonomy of Existing DNN Dataflows tionary, the No-Local-Reuse (NLR) Numerous previous efforts have proposed dataflow keeps no data stationary solutions for DNN acceleration. These locally so it can trade the RF off for a designs reflect a variety of trade-offs between larger global buffer. This is to mini- performance and implementation complex- mize DRAM access energy consump- ity. Despite their differences in low-level tion by storing more data on-chip implementation details, we find that many of (for example, work by Tianshi Chen them can be described as embodying a set of and colleagues11 and Chen Zhang rules—that is, a dataflow—that defines the and colleagues12). The corresponding ...... 16 IEEE MICRO mapping rule is that at each process- ing cycle, all parallel MACs must Ifmap pixel (l) Filter weight (W) Psum (P) come from a unique pair of filter and Weight-Stationary (WS) dataflow channel. The data movement of the NLR dataflow is to single-cast weights, Global buffer multicast ifmap pixels, and spatially P8 I8 P0 accumulate psums across the PE array (see Figure 6c). W0 P7 W1 P6 W2 P5 W3 P4 W4 P3 W5 P2 W6 P1 W7 PE The three dataflows show distinct data (a) movement patterns, which imply different Output-Stationary (OS) dataflow tradeoffs. First, as Figures 6a and 6b show, Global buffer the cost for keeping a specific data type sta- I7 W7 tionary is to move the other types of data more. Second, the timing of data accesses also matters. For example, in the WS data- P0 I6 P1 I5 P2 I4 P3 I3 P4 I2 P5 I1 P6 I0 P7 PE flow, each ifmap pixel read from the global (b) buffer is broadcast to all PEs with properly No-Local-Reuse (NLR) dataflow mapped MACs on the PE array. This is more efficient than reading the same value multiple Global buffer times from the global buffer and single-cast- W0P8I0 W1P9 W2 I1 W3 W4 I2 W5 W6P0 I3 W7P1 ing it to the PEs, which is the case for filter P6 P4 P2 weights in the NLR dataflow (see Figure 6c). PE Other dataflows can make other tradeoffs. In P7 P5 P3 the next section, we present a new dataflow (c) that takes these factors into account for opti- mizing energy efficiency. Figure 6. Dataflow taxonomy. (a) Weight Stationary. (b) Output Stationary. (c) No Local Reuse. An Energy-Efficient Dataflow The ordering of these MACs enables Although the dataflows in the taxonomy the use of a sliding window for ifmaps, describe the design of many DNN accelera- as shown in Figure 7. tors, they optimize data movement only for a specific data type (for example, WS for Convolutional and psum reuse opportu- weights) or storage level (NLR for DRAM). nities within a row primitive are fully In this section, we introduce a new dataflow, exploited in the RF, given sufficient RF stor- called Row-Stationary (RS), which aims to age capacity. optimize data movement for all data types in Even with the RS dataflow, as defined by all levels of the storage hierarchy of a spatial the row primitives, there are still a large num- architecture. ber of valid mapping choices. These mapping The RS dataflow divides the MACs into choices arise both in the spatial and temporal mapping primitives, each of which comprises assignment of primitives to PEs: a subset of MACs that run on the same PE in a fixed order. Specifically, each mapping 1. One spatial mapping option is to primitive performs a 1D row convolution, so assign primitives with data rows we call it a row primitive, and intrinsically from the same 2D plane on the PE optimizes data reuse per MAC for all data array, to lay out a 2D convolution types combined. Each row primitive is (see Figure 8). This mapping fully formed with the following rules: exploits convolutional and psum reuse opportunities across primitives The MACs for applying a row of fil- in the NoC: the same rows of filter ter weights on a row of ifmap pixels, weights and ifmap pixels are reused which generate a row of psums, must across PEs horizontally and diago- be mapped on the same PE. nally, respectively; psum rows are ...... MAY/JUNE 2017 17 ...... TOP PICKS

becomes a larger 1D row convolution, Filter row Ifmap row Psum row which exploits these cross-primitive A BC∗ a bcde = xyz data reuse opportunities in the RF. 4. Another temporal mapping choice Time arises when the PE array size is too MAC MAC MAC MAC MAC MAC MAC MAC MAC 1 2 3 4 5 6 7 8 9 small, and the originally spatially Filter weight: ABC ABC ABC mapped row primitives must be tem- x+xx + xxx + + xxx + + porally folded into multiple process- Ifmap pixel: abc abc abc ing passes (that is, the computation is II II II serialized). In this case, the data reuse Psum: xyz opportunities that are originally spa- tially exploited in the NoC can be Figure 7. Each row primitive in the Row-Stationary (RS) dataflow runs a 1D temporally exploited by the global row convolution on the same PE in a sliding-window processing order. buffer to avoid DRAM accesses, given sufficient storage capacity. As evident from the preceding list, the RS dataflow provides a high degree of mapping Row 1 Row 2 Row 3 flexibility, such as using concatenation, inter- PE1 PE4 PE7 leaving, duplicating, and folding of the row Row 1∗ Row 1 Row 1∗ Row 2 Row 1∗ Row 3 primitives. The mapper searches for the exact PE2 PE5 PE8 amount to apply each technique in the opti- Row 2∗ Row 2 Row 2∗ Row 3 Row 2∗ Row 4 mal mapping—for example, how many fil- PE3 PE6 PE9 ters are interleaved on the same PE to exploit Row 3∗ Row 3 Row 3∗ Row 4 Row 3∗ Row 5 ifmap reuse—to minimize overall system energy consumption. ∗ ==∗∗ = Dataflow Comparison Figure 8. Patterns of how row primitives from the same 2D plane are In this section, we quantitatively compare the mapped onto the PE array in the RS dataflow. energy efficiency of different DNN dataflows in a spatial architecture, including those from the taxonomy and the proposed RS dataflow. further accumulated across PEs We use AlexNet5 as the benchmarking DNN vertically. because it is one of the most popular DNNs 2. Another spatial mapping option available, and it comprises five convolutional arises when the size of the PE array is (CONV) layers and three fully connected large, and the pattern shown in (FC) layers with a wide variety of shapes and Figure 8 can be spatially duplicated sizes, which can more thoroughly evaluate across the PE array for various 2D the optimal mappings from each dataflow. convolutions. This not only increases In order to have a fair comparison, we utilization of PEs, but also further apply the following two constraints for all exploits filter, ifmap, and psum reuse dataflows. First, the size of the PE array is opportunities in the NoC. fixed at 256 for constant processing through- 3. One temporal mapping option arises put across dataflows. Second, the total hard- when row primitives from different ware area is also fixed. For example, because 2D planes can be concatenated or the NLR dataflow does not use an RF, it can interleaved on the same PE. As Figure allocate more area for the global buffer. The 9 shows, primitives with different corresponding hardware resource parameters ifmaps, filters, and channels have filter are based on the RS dataflow implementation reuse, ifmap reuse, and psum reuse in Eyeriss, a DNN accelerator chip fabricated opportunities, respectively. By concat- in 65-nm CMOS.4 By applying these con- enating or interleaving their computa- straints, we fix the total cost to implement tion together in a PE, it virtually the microarchitecture of each dataflow...... 18 IEEE MICRO Filter 1 Ifmap 1 Psum 1 Channel 1Row 1 ∗ Row 1 = Row 1 Filter 1 Ifmap 1 and 2 Psum 1 and 2 ∗ Filter 1 Ifmap 2 Psum 2 Row 1 Row 1 Row 1= Row 1 Row 1 Channel 1 Row 1 ∗ Row 1 = Row 1 filter reuse (a)

Filter 1 Ifmap 1 Psum 1 Channel 1 Row 1 ∗ Row 1= Row 1 Filter 1 and 2 Ifmap 1 Psum 1 and 2 ∗ = Filter 2 Ifmap 1 Psum 2 Channel 1 Row 1 ∗ Row 1 = Row 1 Ifmap reuse (b)

Filter 1 Ifmap 1 Psum 1 Channel 1 Row 1∗ Row 1 = Row 1 Filter 1 Ifmap 1 Psum ∗ Filter 1 Ifmap 1 Psum 1 = Row 1 Channel 2 Row 1 ∗ Row 1 = Row 1 psum reuse (can be further accumulated) (c)

Figure 9. Row primitives from different 2D planes can be combined by concatenating or interleaving their computation on the same PE to further exploit data reuse at the RF level. (a) Two-row primitives reuse the same filter row for different ifmap rows. (b) Two-row primitives reuse the same ifmap row for different filter rows. (c) Two-row primitives from different channels further accumulate psum rows.

Therefore, the differences in energy efficiency DRAM alone does not dictate energy effi- are solely from the dataflows. ciency, and optimizing the energy consump- Figures 10a and 10b show the comparison tion for only a certain data type does not lead of energy efficiency between dataflows in the to the best system energy efficiency. Overall, CONV layers of AlexNet with an ifmap batch the RS dataflow is 1.4 to 2.5 times more size of 16. Figure 10a gives the breakdown in energy efficient than other dataflows in the terms of storage levels and ALU, and Figure CONV layers of AlexNet. 10b gives the breakdown in terms of data Figure 11 shows the same experiment types. First, the ALU energy consumption is results as in Figure 10b, except that it is for only a small fraction of the total energy con- the FC layers of AlexNet. Compared to the sumption, which proves the importance of CONV layers, the FC layers have no convo- data movement optimization. Second, even lutional reuse and use much more filter though NLR has the lowest energy consump- weights. Still, the RS dataflow is at least 1.3 tion in DRAM, its total energy consumption times more energy efficient than the other is still high, because most of the data accesses dataflows, which proves that the capability to come from the global buffer, which are more optimize data movement for all data types is expensive than those from the NoC or RF. the key to achieving the highest overall Third, although WS and OS dataflows clearly energy efficiency. Note that the FC layers optimize the energy consumption of accessing account for less than 20 percent of the total weights and psums, respectively, they sacrifice energy consumption in AlexNet. In recent the energy consumption of moving other data DNNs, the number of FC layers have also types, and therefore do not achieve the lowest been greatly reduced, making their energy overall energy consumption. This shows that consumption even less significant...... MAY/JUNE 2017 19 ...... TOP PICKS

2.0 2.0

1.5 RF 1.5 NoC psums 1.0 Buffer 1.0 weights DRAM pixels 0.5 ALU 0.5 Normalized energy/MAC Normalized energy/MAC 0 0 WSOSA OSB OSC NLR RS WSOSA OSB OSC NLR RS (a)DNN dataflows (b) DNN dataflows

Figure 10. Comparison of energy efficiency between different dataflows in the convolutional (CONV) layers of AlexNet.5 (a)

Breakdown in terms of storage levels and ALU versus (b) data types. OSA,OSB, and OSC are three variants of the OS dataflow that are commonly seen in different implementations.3

tinct architectures, it is also possible to come up with a union architecture that can support 2.0 multiple dataflows simultaneously. The ques- tions are how to choose a combination of 1.5 psums dataflows that maximally benefit the search weights for optimal mappings, and how to support 1.0 pixels these dataflows with the minimum amount of hardware implementation overhead. 0.5 This article has also pointed out how the

Normalized energy/MAC 0 concept of DNN dataflows and the mapping WSOSA OSB OSC NLR RS of a DNN computation onto a dataflow can be DNN dataflows viewed as analogous to a general-purpose pro- cessor’s architecture and compiling onto that Figure 11. Comparison of energy efficiency between different dataflows in architecture. We hope this will open up space the fully connected (FC) layers of AlexNet. for computer architects to approach the design of DNN accelerators by applying the knowl- edge and techniques from a well-established esearch on architectures for DNN research field in a more systematic manner, R accelerators has become very popular such as methodologies for design abstraction, for its promising performance and wide modularization, and performance evaluation. applicability. This article has demonstrated For instance, a recent research trend for the key role of dataflows in DNN accelerator DNNs is to exploit data statistics. Specifically, design, and it shows how to systematically different proposals on quantization, pruning, exploit all types of data reuse in a multilevel and data representation have all shown prom- storage hierarchy for optimizing energy effi- ising results on improving the performance of ciency with a new dataflow. It challenges con- DNNs. Therefore, it is important that new ventional design approaches, which focus architectures also take advantage of these find- more on optimizing parts of the problem, ings. As compilers for general-purpose pro- and shifts it toward a global optimization cessors can take the profile of targeted that considers all relevant metrics. workloads to further improve the performance The taxonomy of dataflows lets us compare of the generated binary, the analogy between high-level design choices irrespective of low- general-purpose processors and DNN acceler- level implementation details, and thus can be ators suggests that the mapper for DNN accel- used to guide future designs. Although these erators might also take the profile of targeted dataflows are currently implemented on dis- DNN statistics to further optimize the ...... 20 IEEE MICRO generated mappings. This is an endeavor we 11. T. Chen et al., “DianNao: A Small-Footprint will leave for future work. MICRO High-Throughput Accelerator for Ubiquitous Machine-Learning,” Proc. 19th Int’l Conf...... Architectural Support for Programming Lan- References guages and Operating Systems (ASPLOS 1. M. Horowitz, “Computing’s Energy Prob- 14), 2014, pp. 269–284. lem (And What We Can Do About It),” Proc. 12. C. Zhang et al., “Optimizing FPGA-based IEEE Int’l Solid-State Circuits Conf. (ISSCC Accelerator Design for Deep Convolutional 14), 2014, pp. 10–14. Neural Networks,” Proc. ACM/SIGDA Int’l 2. A. Parashar et al., “Triggered Instructions: A Symp. Field-Programmable Gate Arrays Control Paradigm for Spatially-Programmed (FPGA 15), 2015, pp. 161–170. Architectures,” Proc. 40th Ann. Int’l Symp. Computer Architecture (ISCA 13), 2013, pp. Yu-Hsin Chen is a PhD student in the 142–153. Department of Electrical Engineering and 3. Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Computer Science at the Massachusetts Spatial Architecture for Energy-Efficient Data- Institute of Technology. His research inter- flow for Convolutional Neural Networks,” Proc. ests include energy-efficient multimedia sys- ACM/IEEE 43rd Ann. Int’l Symp. Computer tems, deep learning architectures, and com- Architecture (ISCA 16), 2016, pp. 367–379. puter vision. Chen received an MS in electrical engineering and computer science 4. Y.-H. Chen et al., “Eyeriss: An Energy- from the Massachusetts Institute of Tech- Efficient Reconfigurable Accelerator for nology. He is a student member of IEEE. Deep Convolutional Neural Networks,” Contact him at [email protected]. Proc. IEEE Int’l Solid-States Circuits Conf. (ISSCC 16), 2016, pp. 262–263. Joel Emer is a senior distinguished research 5. A. Krizhevsky, I. Sutskever, and G.E. Hinton, scientistatNvidiaandaprofessorofelectrical “ImageNet Classification with Deep Convo- engineering and computer science at the Mas- lutional Neural Networks,” Proc. 25th Int’l sachusetts Institute of Technology. His research Conf. Neural Information Processing Sys- interests include spatial and parallel architec- tems (NIPS 12), 2012, pp. 1097–1105. tures, performance modeling, reliability analy- 6. S. Chakradhar et al., “A Dynamically Config- sis, and memory hierarchies. Emer received a urable Coprocessor for Convolutional Neural PhD in electrical engineering from the Uni- Networks,” Proc. 37th Ann. Int’l Symp. versity of Illinois. He is a Fellow of IEEE. Computer Architecture (ISCA 10), 2010, pp. Contact him at [email protected]. 247–257. Vivienne Sze is an assistant professor in the 7. V. Gokhale et al., “A 240 G-ops/s Mobile Department of Electrical Engineering and Coprocessor for Deep Neural Networks,” Computer Science at the Massachusetts Proc. IEEE Conf. Computer Vision and Pat- Institute of Technology. Her research inter- tern Recognition Workshops (CVPRW 14), ests include energy-aware signal processing 2014, pp. 696–701. algorithms and low-power architecture and 8. Z. Du et al., “ShiDianNao: Shifting Vision system design for multimedia applications, Processing Closer to the Sensor,” Proc. such as machine learning, computer vision, ACM/IEEE 42nd Ann. Int’l Symp. Computer and video coding. Sze received a PhD in elec- Architecture (ISCA 15), 2015, pp. 92–104. trical engineering from the Massachusetts 9. S. Gupta et al., “Deep Learning with Limited Institute of Technology. She is a senior mem- Numerical Precision,” Proc. 32nd Int’l Conf. ber of IEEE. Contact her at [email protected]. Machine Learning, vol. 37, 2015, pp. 1737–1746. 10. M. Peemen et al., “Memory-Centric Accelera- tor Design for Convolutional Neural Networks,” Read your subscriptions through Proc. IEEE 31st Int’l Conf. Computer Design the myCS publications portal at http://mycs.computer.org. (ICCD 13), 2013, pp. 13–19...... MAY/JUNE 2017 21 ...... THE MEMRISTIVE BOLTZMANN MACHINES

......

THE PROPOSED MEMRISTIVE BOLTZMANN MACHINE IS A MASSIVELY PARALLEL,

MEMORY-CENTRIC HARDWARE ACCELERATOR BASED ON RECENTLY DEVELOPED RESISTIVE

RAM (RRAM) TECHNOLOGY.THE PROPOSED ACCELERATOR EXPLOITS THE ELECTRICAL

PROPERTIES OF RRAM TO REALIZE IN SITU, FINE-GRAINED PARALLEL COMPUTATION

WITHIN MEMORY ARRAYS, THEREBY ELIMINATING THE NEED FOR EXCHANGING DATA

BETWEEN THE MEMORY CELLS AND COMPUTATIONAL UNITS.

...... Combinatorial optimization is a hardware.2 With the growing interest in deep Mahdi Nazm Bojnordi branch of discrete mathematics that is con- learning models that rely on Boltzmann University of Utah cerned with finding the optimum element of a machines for training (such as deep belief net- finite or countably infinite set. An enormous works), the importance of high-performance number of critical problems in science and Boltzmann machine implementations is Engin Ipek engineering can be cast within the combinato- increasing. Regrettably, the required all-to-all rial optimization framework, including classi- communication among the processing units University of Rochester calproblemssuchastravelingsalesman,integer limits these recent efforts’ performance. linear programming, knapsack, bin packing, The memristive Boltzmann machine is a and scheduling problems, as well as numerous massively parallel, memory-centric hardware optimization problems in machine learning accelerator for the Boltzmann machine based and data mining. Because many of these prob- on recently developed resistive RAM lems are NP-hard, heuristic algorithms are (RRAM) technology. RRAM is a memristive, commonly used to find approximate solutions nonvolatile memory technology that provides for even moderately sized problem instances. Flash-like density and DRAM-like read Simulated annealing is one of the most speed. The accelerator exploits the electrical commonly used optimization algorithms. On properties of the bitlines and wordlines in a many types of NP-hard problems, simulated conventional single-level cell (SLC) RRAM annealing achieves better results than other array to realize in situ, fine-grained parallel heuristics; however, its convergence may be computation, which eliminates the need for slow. This problem was first addressed by exchanging data among the memory arrays reformulating simulated annealing within the and computational units. The proposed context of a massively parallel computational hardware platform connects to a general- model called the Boltzmann machine.1 The purpose system via the DDRx interface and Boltzmann machine is amenable to a massively can be selectively integrated with systems that parallel implementation in either software or run optimization workloads......

22 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE Computation within Memristive Arrays S 9 –10 9 Thekeyideabehindtheproposedmemory- 5 1 centric accelerator is to exploit the electrical 4 1 3 Mapping –8 –2 –6 properties of the storage cells and the intercon- 1 7 –2 –14 110 nections among those cells to compute the dot 5 9 10 product—the fundamental building block Cost = 19 Energy = –19 of the Boltzmann machine—in situ within the memory arrays. This novel capability of Figure 1. Mapping a Max-Cut problem to the Boltzmann machine model. An the proposed memristive arrays eliminates example five-vertex undirected graph is mapped and partitioned using a five- unnecessary latency, bandwidth, and energy node Boltzmann machine. overheads associated with streaming the data out of the memory arrays during computation.

x + j – Vsupply The Boltzmann Machine li The Boltzmann machine, proposed by Geof- x frey Hinton and colleagues in 1983,2 is a 0 well-known example of a stochastic neural Wordlines wji network that can learn internal representa- Bitline xi tions and solve combinatorial optimization Iji problems. The Boltzmann machine is a fully connected network comprising two-state Ij = Iji = xjxixji units. It employs simulated annealing for i = 0 i = 0 transitioning between the possible network states. The units flip their states on the basis Figure 2. The key concept of in situ of the current state of their neighbors and the computation within memristive arrays. corresponding edge weights to maximize a Current summation within every bitline is global consensus function, which is equiva- used to compute the result of a dot product. lent to minimizing the network energy. Many combinatorial optimization prob- (dij) are represented by a symmetric weight lems, as well as machine learning tasks, can be matrix, the maximum cut problem is to find a mapped directly onto a Boltzmann machine subsetXS {1, …, N} of the nodes that maxi- mizes d ,inwhichi ʦ S and j 26 S. To by choosing the appropriate edge weights and i;j ij theinitialstateoftheunitswithinthenet- solve the problem on a Boltzmann machine, a work. As a result of this mapping, each possi- one-to-one mapping is established between ble state of the network represents a candidate the graph G and a Boltzmann machine with solution to the optimization problem, and N processing units. TheX Boltzmann machine is configured as w ¼ d and w ¼ –2d . minimizing the network energy becomes jj i ji ji ji equivalent to solving the optimization prob- When the machine reaches its lowest energy, lem. The energy minimization process is typi- (E(x) ¼19), the state variables represent the cally performed either by adjusting the edge optimum solution, in which a value of 1 at weights (learning) or recomputing the unit unit i indicates that the corresponding graphi- states (searching and classifying). This process cal node belongs to S. is repeated until convergence is reached. The solution to an optimization problem In Situ Computation can be found by reading—and appropri- The critical computation that the Boltzmann ately interpreting—the network’s final state. machine performs consists of multiplying a For example, Figure 1 depicts the mapping weight matrix W by a state vector x. Every from an example graph with five vertices to a entry of the symmetric matrix W (wji)records Boltzmann machine with five nodes. The the weight between two units (j and i); every Boltzmann machine is used to solve a Max- entry of the vector x(xi) stores the state of a Cut problem. Given an undirected graph single unit (i). Figure 2 depicts the funda- G with N nodes whose connection weights mental concept behind the design of the ...... MAY/JUNE 2017 23 ...... TOP PICKS

the weights (wji) and the state variables (xi); it Configurable CPU Controller is possible to compute the product of weights interconnect and state variables in situ within the data DDRx arrays. The interconnection network permits Main Memristive Array Array Computational memory accelerator 1 n and the accelerator to retrieve and sum these par- storage arrays tial products to compute the final result.

Figure 3. System overview. The proposed accelerator can be selectively Fundamental Building Blocks integrated in general-purpose computer systems. The fundamental building blocks of the pro- posed memristive Boltzmann machine are storage elements, a current summation circuit, a reduction unit, and a consensus unit. The State variables (x) Connection weights (W ) design of these hardware primitives must strike x1 a careful balance among multiple goals: high memory density, low energy consumption, and in situ, fine-grained parallel computation. xn

Row decoder Controller Storage Elements D/S D/S D/S As Figure 4 shows, the proposed accelerator Computational signal Interface to the data interconnect employs the conventional one-transistor, one-memristor (1T-1R) array to store the Figure 4. The proposed array structure. The conventional one-transistor, connection weights (the matrix W). The rele- one-memristor (1T-1R) array structure is employed to build the proposed vant state variables (the vector x)arekept accelerator. close to the data arrays holding the weights. The memristive 1T-1R array is used for both storing the weights and computing the dot memristive Boltzmann machine. The weights product between these weights and the state and the state variables are represented using variables. memristors and transistors, respectively. A constant voltage supply (Vsupply)isconnected Current Summation Circuit to parallel memristors through a shared verti- The result of a dot product computation is cal bitline. The total current pulled from the obtained by measuring the aggregate current voltage source represents the result of the pulled by the memory cells connected to a computation. This current (Ij)issettozero common bitline. Computing the sum of the when xj is OFF; otherwise, the current is bit products requires measuring the total equal to the sum of the currents pulled by the amount of current per column and merging individual cells connected to the bitline. the partial results into a single sum of prod- ucts. This is accomplished by local column System Overview sense amplifiers and a bit summation tree at Figure 3 shows an example of the proposed the periphery of the data arrays. accelerator that resides on the memory bus and interfaces to a general-purpose computer Reduction Unit system via the DDRx interface. This modular To enable the processing of large matrices organization permits the system designers to using multiple data arrays, an efficient data selectively integrate the accelerator in systems reduction unit is employed. The reduction that execute combinatorial optimization and units are used to build a reduction network, machine learning workloads. The memristive which sums the partial results as they are Boltzmann machine comprises a hierarchy of transferred from the data arrays to the con- data arrays connected via a configurable troller. Large matrix columns are partitioned interconnection network. A controller imple- and stored in multiple data arrays, in which ments the interface between the accelerator the partial sums are individually computed. and the processor. The data arrays can store The reduction network merges the partial ...... 24 IEEE MICRO results into a single sum. Multiple such net- works are used to process the weight columns in parallel. The reduction tree comprises a A hierarchy of bit-serial adders to strike a bal- A large ance between throughput and area efficiency. matrix B F.A. Output Figure 5 shows the proposed reduction column Mode Output mechanism. The column is partitioned into Forwarding A four segments, each of which is processed Mode Reduction A+B separately to produce a total of four partial results. The partial results are collected by a Figure 5. The proposed reduction element. The reduction element can reduction network comprising three bimodal operate in forwarding or reduction mode. reduction elements. Each element is config- ured using a local latch that operates in one of two modes: forwarding or reduction. Each reduction unit employs a full adder to com- Decimal point 64 evenly sampled points from sigmoid 1.0 pute the sum of the two inputs when operat- 0.8 ing in the reduction mode. In the forwarding In 0.6 64 × 16 0.4 mode, the unit is used for transferring the lookup 0.2 0 content of one input upstream to the root. table Out (probability) –4 –3 –2 –1 0 1 2 3 4 Bit extension Out In (energy difference)

Accept/Reject Consensus Unit Pseudorandom generator The Boltzmann machine relies on a sigmoi- dal activation function, which plays a key Figure 6. The proposed unit for the activation function. A 64-entry lookup role in both the model’s optimization and table is used for approximating the sigmoid function. machine learning applications. A precise implementation of the sigmoid function, however, would introduce unnecessary energy and performance overheads. The pro- Chip Bank Subbank Mat posed memristive accelerator employs an F/R approximation unit using logic gates and lookup tables to implement the consensus F/R function. As Figure 6 shows, the table con- F/R Reduction Subbank Data tains 64 precomputed sample points of the Controller tree tree array 1 sigmoid function f ðxÞ¼1þex ,inwhichx varies between –4 and 4. The samples are Figure 7. Hierarchical organization of a chip. A chip controller is employed to evenly distributed on the x-axis. Six bits of a manage the multiple independent banks. given fixed-point value are used to index the lookup table and retrieve a sample value. The most significant bits of the input data are bank 0 while any location of bank 1 is being ANDed and NORed to decide whether the read. Within each bank, a set of sub-banks is input value is outside the domain [–4, 4]; if connected to a shared interconnection tree. so, the sign bit is extended to implement f(x) The bank interconnect is equipped with ¼ 0orf(x) ¼ 1; otherwise, the retrieved sam- reduction units to contribute to the dot prod- ple is chosen as the outcome. uct computation. In the reduction mode, all sub-banks actively produce the partial results, while the reduction tree selectively merges the System Architecture results from a subset of the sub-banks. This The proposed architecture for the memristive capability is useful for computing the large Boltzmann machine comprises multiple banks matrix columns partitioned across multiple and a controller (see Figure 7). The banks sub-banks. Each sub-bank comprises multiple operate independently and serve memory and mats, each of which is composed of a control- computation requests in parallel. For example, ler and multiple data arrays. The sub-bank column 0 can be multiplied by the vector x at tree transfers the data bits between the mats ...... MAY/JUNE 2017 25 ...... TOP PICKS

and the bank tree in a bit-parallel fashion, The memristive Boltzmann machine is inter- thereby increasing the parallelism. faced to a single-core system via a single DDR3-1600 channel. We develop an Data Organization RRAM-based processing-in-memory (PIM) To amortize the peripheral circuitry’s cost, the baseline. The weights are stored within data data array’s columns and rows are time shared. arrays that are equipped with integer and Each sense amplifier is shared by four bitlines. binary multipliers to perform the dot prod- The array is vertically partitioned along the ucts. The proposed consensus units, optimi- bitlines into 16 stripes, multiples of which can zation and training controllers, and mapping be enabled per array computation. This allows algorithms are employed to accelerate the the software to keep a balance between the annealing and training processes. When com- accuracy of the computation and the perform- pared to existing computer systems and ance for a given application by quantizing GPU-based accelerators, the PIM baseline more bit products into a fixed number of bits. can achieve significantly higher performance and energy efficiency because it eliminates On-Chip Control the unnecessary data movement on the mem- The proposed hardware can accelerate opti- ory bus, exploits data parallelism throughout mization and deep learning tasks by appro- the chip, and transfers the data across the priately configuring the on-chip controller. chip using energy-efficient reduction trees. The controller configures the reduction trees, The PIM baseline is optimized so that it maps the data to the internal resources, occupies the same area as that of the memris- orchestrates the data movement among the tive accelerator. banks, performs annealing or training tasks, and interfaces to the external bus. Area, Delay, and Power Breakdown We model the data array, sensing circuits, DIMM Organization drivers, local array controller, and interconnect elements using Spice predictive technology To solve large-scale optimization and models4 of n-channel and p-channel metal- machine learning problems whose state oxide semiconductor transistors at 22 nm. spaces do not fit within a single chip, we can Thefulladders,latches,andcontrollogicare interconnect multiple accelerators on a synthesized using FreePDK5 at 45 nm. We DIMM. Each DIMM is equipped with con- first scale the results to 22 nm using scaling trol registers, data buffers, and a controller. parameters reported in prior work,6 and then This controller receives DDRx commands, scale them using the fan-out of 4 (FO4) data, and address bits from the external inter- parameters for International Technology Road- face and orchestrates computation among all map for Semiconductors low-standby-power of the chips on the DIMM. (LSTP) devices to model the impact of using a memory process on peripheral and global Software Support circuitry.7,8 We use McPAT9 to estimate the To make the proposed accelerator visible to processor power. software, we memory map its address range Figure 8 shows a breakdown of the compu- to a portion of the physical address space. A tational energy, leakage power, computational small fraction of the address space within latency, and die area among different hard- every chip is mapped to an internal RAM ware components. The sense amplifiers and array and is used to implement the data buf- interconnects are the major contributors to fers and configuration parameters. Software the dynamic energy (41 and 36 percent, configures the on-chip data layout and ini- respectively). The leakage is caused mainly by tiates the optimization by writing to a mem- the current summation circuits (40 percent) ory mapped control register. and other logic (59 percent), which includes thechargepumps,writedrivers,andcontrol- Evaluation Highlights lers. The computation latency, however, is We modify the SESC simulator3 to model a due mainly to the interconnects (49 percent), baseline eight-core out-of-order processor. the wordlines, and the bitlines (32 percent)...... 26 IEEE MICRO Notably, only a fraction of the memory arrays must be active during a computational opera- Others Interconnects Sense amplifiers Data arrays Peak energy (8.6 nJ) tion. A subset of the mats within each bank Leakage power (405 mW) performs current sensing of the bitlines; the Computational latency (6.59 ns) Die area (25.67 mm2) partial results are then serially streamed to the 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% controller on the interconnect wires. The experiments indicate that a fully utilized accel- Figure 8. Area, delay, and power breakdown. Peak energy, leakage erator (IC) consumes 1.3 W, power, computational latency, and die area are estimated at the 22-nm which is below the peak power rating of a technology node. standard DDR3 chip (1.4 W). Performance Figure 9 shows the performance on the Baseline Multithreaded kernel PIM Memristive accelerator 100 proposed accelerator, the PIM architecture, the multicore system running the multi- 10 threaded kernel, and the single-core system running the semidefinite programing (SDP) 1 and MaxWalkSAT kernels. The results are nor- Speedup over the single-threaded kernel single-threaded 0.1 malized to the single-threaded kernel running MS-1 MS-2 MS-3 MS-4 MS-5 MS-6 MS-7 MS-8 MS-9 MC-1 MC-2 MC-3 MC-4 MC-5 MC-6 MC-7 MC-8 MC-9 MS-10 on a single core. The results indicate that the MC-10 single-threaded kernel (Boltzmann machine) is Geomean faster than the baselines (SDP and MaxWalk- SAT heuristics) by an average of 38 percent. Figure 9. Performance on optimization. Speedup of various system The average performance gain for the mul- configurations over the single-threaded kernel. tithreaded kernel is limited to 6 percent, owing to significant state update overheads. PIM outperforms the single-threaded ker- Baseline Multithreaded kernel PIM Memristive accelerator nel by 9.31 times. The memristive accelera- 100 tor outperforms all of the baselines (57.75 10 times speedup over the single-threaded ker- nel and 6.19 times over PIM). Moreover, 1 the proposed accelerator performs the deep 0.1 single-threaded kernel single-threaded learning tasks 68.79 times faster than the Energy savings over the MS-1 MS-2 MS-3 MS-4 MS-5 MS-6 MS-7 MS-8 MS-9 MC-1 MC-2 MC-3 MC-4 MC-5 MC-6 MC-7 MC-8 MC-9 MS-10 single-threaded kernel and 6.89 times faster MC-10 Geomean than PIM. Energy Figure 10. Energy savings on optimization. Energy savings of various system configurations over the single-threaded kernel. Figure 10 shows the energy savings as com- pared to PIM, the multithreaded kernel, SDP, and MaxWalkSAT. On average, energy cycle-to-cycle and device-to-device variabil- is reduced by 25 times as compared to the ities. We evaluate the impact of cycle-to-cycle single-threaded kernel implementation, which variation on the computation’s outcome by is 5.2 times better than PIM. For the deep considering a bit error rate of 105 in all of learning tasks, the system energy is improved the simulations, along the lines of the analy- by 63 times, which is 5.3 times better than the sis provided in prior work.10 The proposed energy consumption of PIM. accelerator successfully tolerates such errors, with less than a 1-percent change in the out- Sensitivity to Process Variations come as compared to a perfect software Memristor parameters can deviate from their implementation. nominal values, owing to process variations The resistance of RRAM cells can fluctu- caused by line edge roughness, oxide thick- ate because of the device-to-device variation, ness fluctuation, and random discrete dop- which can impact the outcome of a column ing. These parameter deviations result in summation—that is, a partial dot product...... MAY/JUNE 2017 27 ...... TOP PICKS

We use the geometric model of memri- Emerging large-scale applications such as stance variation proposed by Miao Hu and combinatorial optimization and deep learn- colleagues11 to conduct Monte Carlo simu- ing tasks are even more influenced by mem- lations for 1 million columns, each com- ory bandwidth and power problems. In these prising 32 cells. The experiment yields two applications, massive datasets have to be iter- distributions for low resistance (RLO)and atively accessed by the processor cores to high resistance (RHI) samples that are then achieve a desirable output quality, which approximated by normal distributions with consumes excessive memory bandwidth and respective standard deviations of 2.16 and system energy. To address this problem, 2.94 percent (similar to the prior work by numerous software and hardware optimiza- Hu and colleagues). We then find a bit pat- tions using GPUs, clusters based on message tern that results in the largest summation passing interface (MPI), field-programmable error for each column. We observe up to gate arrays, and application-specific inte- 2.6 106 deviation in the column con- grated circuits have been proposed in the ductance, which can result in up to 1 bit literature. These proposals focus on energy- error per summation. Subsequent simula- efficient computing with reduced data move- tion results indicate that the accelerator can ment among the processor cores and memory tolerate this error, with less than a 2 percent arrays. These proposals’ performance and change in the outcome quality. energy efficiency are limited by read accesses that are necessary to move the operands from Finite Switching Endurance the memory arrays to the processing units. A RRAM cells exhibit finite switching endur- memory subsystem that allows for in situ ance ranging from 1e6 to 1e12 writes. We computation within its data arrays could evaluate the impact of finite endurance on an address these limitations by eliminating the accelerator module’s lifetime. Because wear is need to move raw data between the memory induced only by the updating of the weights arrays and the processor cores. stored in memristors, we track the number of Designing a platform capable of perform- times that each weight is written. The edge ing in situ computation is a significant chal- weights are written once in optimization lenge. In addition to storage cells, extra problems and multiple times in deep learning circuits are required to perform analog com- workloads. (Updating the state variables, putation within the memory cells, which stored in static CMOS latches, does not decreases memory density and area efficiency. induce wear on RRAM.) We track the total Moreover, power dissipation and area con- number of updates per second to estimate sumption of the required components for the lifetime of an eight-chip DIMM. Assum- signal conversion between analog and digital ing endurance parameters of 1e6 and 1e8 domains could become serious limiting fac- writes, the respective module lifetimes are 3.7 tors. Hence, it is critical to strike a careful bal- and 376 years for optimization and 1.5 and ance between the accelerator’s performance 151 years for deep learning. and complexity. The memristive Boltzmann machine is ata movement between memory cells the first memory-centric accelerator that D and processor cores is the primary con- addresses these challenges. It provides a new tributor to power dissipation in computer framework for designing memory-centric systems. A recent report by the US Depart- accelerators. Large scale combinatorial opti- ment of Energy identifies the power con- mization problems and deep learning tasks sumed in moving data between the memory are mapped onto a memory-centric, non- and the processor as one of the 10 most sig- Von Neumann computing substrate and nificant challenges in the exascale computing solved in situ within the memory cells, with era.12 The same report indicates that by orders of magnitude greater performance and 2020, the energy cost of moving data across energy efficiency than contemporary super- the memory hierarchy will be orders of mag- computers. Unlike PIM-based accelerators, nitude higher than the cost of performing a the proposed accelerator enables computation double-precision floating-point operation. within conventional data arrays to achieve the ...... 28 IEEE MICRO energy-efficient and massively parallel proc- 9. S. Li et al., “McPAT: An Integrated Power, essing required for the Boltzmann machine Area, and Timing Modeling Framework for model. Multicore and Manycore Architectures,” We expect the proposed memory-centric Proc. 36th Int’l Symp. Computer Architec- accelerator to set off a new line of research on ture (ISCA), 2009, pp. 468–480. in situ approaches to accelerate large-scale 10. D. Niu et al., “Impact of Process Variations problems such as combinatorial optimization on Emerging Memristor,” Proc. 47th ACM/ and deep learning tasks and to significantly IEEE Design Automation Conf. (DAC), 2010, increase the performance and energy effi- pp. 877–882. MICRO ciency of future computer systems. 11. M. Hu et al., “Geometry Variations Analysis of Tio 2 Thin-Film and Spintronic Mem- Acknowledgments ristors,” Proc. 16th Asia and South Pacific This work was supported in part by NSF Design Automation Conf., 2011, pp. 25–30. grant CCF-1533762. 12. The Top Ten Exascale Research Challenges, tech. report, Advanced Scientific Comput- ...... ing Advisory Committee Subcommittee, References Dept. of Energy, 2014. 1. E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Mahdi Nazm Bojnordi is an assistant pro- Neural Computing, John Wiley & Sons, fessor in the School of Computing at the 1989. University of Utah. His research focuses on computer architecture, with an emphasis on 2. S.E. Fahlman, G.E. Hinton, and T.J. Sejnow- energy-efficient computing. Nazm Bojnordi ski, “Massively Parallel Architectures for AI: received a PhD in electrical engineering NETL, Thistle, and Boltzmann Machines,” from the University of Rochester. Contact Proc. Assoc. Advancement of AI (AAAI), him at [email protected]. 1983, pp. 109–113. 3. J. Renau et al., “SESC Simulator,” Jan. Engin Ipek is an associate professor in the 2005; http://sesc.sourceforge.net. Department of Electrical and Computer 4. W. Zhao and Y. Cao, “New Generation of Engineering and the Department of Com- Predictive Technology Model for Sub-45nm puter Science at the University of Rochester. Design Exploration,” Proc. Int’l Symp. Qual- His research interests include energy-efficient ity Electronic Design, 2006, pp. 585–590. architectures, high-performance memory sys- 5. “FreePDK 45nm Open-Access Based PDK tems, and the application of emerging tech- for the 45nm Technology Node,” 29 May nologies to computer systems. Ipek received 2014; www.eda.ncsu.edu/wiki/FreePDK. a PhD in electrical and computer engineer- ing from Cornell University. He has received 6. M.N. Bojnordi and E. Ipek, “Pardis: A Pro- the 2014 IEEE Computer Society TCCA grammable Memory Controller for the Young Computer Architect Award, two DDRX Interfacing Standards,” Proc. 39th IEEE Micro Top Picks awards, and an NSF Ann. Int’l Symp. Computer Architecture CAREER award. Contact him at ipek@ece (ISCA), 2012 pp. 13–24. .rochester.edu. 7. N.K. Choudhary et al., “Fabscalar: Compos- ing Synthesizable RTL Designs of Arbitrary Cores Within a Canonical Superscalar Template,” Proc. 38th Ann. Int’l Symp. Com- puter Architecture, 2011, pp. 11–22. 8. S. Thoziyoor et al., “A Comprehensive Mem- ory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hier- Read your subscriptions through archies,” Proc. 35th Int’l Symp. Computer the myCS publications portal at http://mycs.computer.org. Architecture (ISCA), 2008, pp. 51–62...... MAY/JUNE 2017 29 ...... ANALOG COMPUTING IN A MODERN CONTEXT:ALINEAR ALGEBRA ACCELERATOR CASE STUDY

......

THIS ARTICLE PRESENTS A PROGRAMMABLE ANALOG ACCELERATOR FOR SOLVING

SYSTEMS OF LINEAR EQUATIONS.THE AUTHORS COMPENSATE FOR COMMONLY PERCEIVED

DOWNSIDES OF ANALOG COMPUTING.THEY COMPARE THE ANALOG SOLVER’S

PERFORMANCE AND ENERGY CONSUMPTION AGAINST AN EFFICIENT DIGITAL ALGORITHM

RUNNING ON A GENERAL-PURPOSE PROCESSOR.FINALLY, THEY CONCLUDE THAT PROBLEM

CLASSES OUTSIDE OF SYSTEMS OF LINEAR EQUATIONS COULD HOLD MORE PROMISE FOR

ANALOG ACCELERATION. Yipeng Huang Ning Guo ...... As we approach the limits of sili- computing (see the sidebar, “Related Work in Mingoo Seok con scaling, it behooves us to reexamine fun- Analog Computing”). damental assumptions of modern computing, To support modern workloads in the digital Yannis Tsividis even well-served ones, to see if they are hin- era, we observed that modern scientific com- dering performance and efficiency. An analog puting and big data problems are converted to Simha Sethumadhavan accelerator discussed in this article breaks linear algebra problems. To maximize analog Columbia University two fundamental assumptions in modern acceleration’s usefulness, we explored whether computing: in contrast to using digital binary analog accelerators are effective at solving numbers, an analog accelerator encodes num- systems of linear equations, the single most bers using the full range of circuit voltage and important numerical primitive in continuous current. Furthermore, in contrast to operating mathematics. step by step on clocked hardware, an analog For readers not familiar with linear alge- accelerator updates its values continuously. bra, systems of linear equations are often These different hardware assumptions can solved using iterative numerical linear algebra provide substantial gains but would need methods, which start with an initial guess for different abstractions and cross-layer optimi- the entire solution vector and update the sol- zations to support various modern workloads. ution vector over iterations of the algorithm, We draw inspiration from an immense each step further minimizing the difference amount of prior work in analog electronic between the guess and the correct solution.1 ......

30 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE Related Work in Analog Computing Analog computers of the mid-20th century were widely used to required. The analog acceleration technique presented in this solve scientific computing problems, described as ordinary differ- article is a procedural approach to solving problems: there is a ential equations (ODEs). Analog computers would solve those predefined way to convert a system of linear equations under ODEs by setting up analog electronic circuits, whose time-depend- study into an analog accelerator configuration. ent voltage and current were described by corresponding ODEs. The analog computers therefore were computational analogies of physical models. References Our group revisited this model of analog computing for solving 1. G. Cowan, R. Melville, and Y. Tsividis, “A VLSI Analog nonlinear ODEs, which frequently appear in cyber-physical sys- Computer/Digital Computer Accelerator,” IEEE J. Solid- tems workloads, with higher performance and efficiency com- State Circuits, vol. 41, no. 1, 2006, pp. 42–53. pared to digital systems.1,2 The analog, continuous-time output of analog computing is especially suited for embedded systems 2. N. Guo et al., “Energy-Efficient Hybrid Analog/Digital applications in which sensor inputs are analog and actuators can Approximate Computation in Continuous Time,” IEEE J. use such results directly. The question for this article is whether Solid-State Circuits, vol. 51, no. 7, 2016, pp. 1514–1524. analog acceleration can help conventional workloads in which 3. M.T. Chu, “On the Continuous Realization of Iterative inputs and outputs are digital. Processes,” SIAM Rev., vol. 30, no. 3, 1988, pp. Modern scientific computation and big data workloads are 375–387. phrased as linear algebra problems. In this article, our analog 4. O. Bournez and M.L. Campagnolo, A Survey on Continu- accelerator solves an ODE that does steepest descent, in turn ous Time Computations, Springer, 2008, pp. 383–423. solving a linear algebra problem. Such a solving method belongs 5. R. LiKamWa et al., “RedEye: Analog ConvNet Image Sen- to a broad class of ODEs that can solve other numerical problems, sor Architecture for Continuous Mobile Vision,” including nonlinear systems of equations.3,4 These ODEs point to SIGARCH Computer Architecture News, vol. 44, no. 3, other ways analog accelerators can support modern workloads. 2016, pp. 255–266. We draw a distinction between our approach to analog accel- eration and that of using analog circuits to build neural net- 6. A. Shafiee et al., “ISAAC: A Convolutional Neural Net- works.5,6 Most importantly, we do not use training to get a work Accelerator with In-Situ Analog Arithmetic in Cross- network topology and weights that solve a given problem. No bars,” SIGARCH Computer Architecture News, vol. 44, prior knowledge of the solution or training set of solutions is no. 3, 2016, pp. 14–26.

Efficient iterative methods such as the conju- itely many iterations. This continuous trajec- gate gradient method are increasingly impor- tory from the original guess vector to the tant because intermediate guess vectors are a correct solution is an ordinary differential good approximation of the correct solution. equation (ODE), which states that the change In discrete-time-iterative linear algebra in a set of variables is a function of the varia- algorithms, the solution vector changes in bles’ present value. We can naturally solve steps, and each step is characterized by a step ODEs using an analog accelerator. size. The step size affects the algorithm’s effi- We give an example of an analog accelera- ciency and requires many processor cycles to tor solving an ODE that in turn solves a calculate. In the conjugate gradient method, system of linear equations. At the analog for example, the step size is calculated from accelerator’s heart are integrators, which con- previous step sizes and the gradient magni- tain the present guess of the solution vector tude, and this calculation takes up half of the represented as an analog signal evolving as a multiplication operations in each conjugate function of time (see Figure 1). We perform gradient step. operations on this solution vector by feeding In an analog accelerator, systems of linear the vector through multiplier and summation equations can also be thought of as solved units. Digital-to-analog converters (DACs) via an iterative algorithm, with an important provide constant coefficients and biases. distinction that the guess vector is updated Using these function units, we create a lin- using infinitesimally small steps, over infin- ear function of the solution vector, which is ...... MAY/JUNE 2017 31 ...... TOP PICKS

Explicit Data-Graph Execution Architecture The analog accelerator uses an explicit data- –a00 dx 0 x (t) flow graph in which the sequence of opera- b0 dt 0 DAC ADC tions on data is realized by connecting functional units end to end. During compu- –a10 tation, analog signals representing intermedi- ate results flow from one unit to the next, so –a 01 there are no overheads in fetching and decod- dx1 b1 dt x1(t) ing instructions, and there are no accesses to DAC ADC digital memory. The former is a benefit of

–a11 digital accelerators, too, but the latter is a unique benefit of the analog computational model. Figure 1. Schematic of an analog accelerator for solving Ax 5 b, a linear Continuous Time Speed system of two equations with two The analog accelerator hardware and algo- unknown variables. Matrix A is a known rithm both operate in continuous time. The matrix of coefficients realized using values contained in the integrators are contin- multipliers; x is an unknown vector uously being updated, and the update rate is contained in integrators; b is a known vector not limited by a finite clock frequency, which of biases generated by digital-to-analog is the limiting factor in discrete-time hard- converters (DACs). Signals are encoded as ware. Furthermore, a continuous-time ODE analog current and are copied using current solution has no concern about the correct mirror fan-out blocks. The solver converges step size to take to update the solution vec- if matrix A is positive definite, which is tor, in contrast to discrete-time iterative algo- usually true for the problems we discuss. rithms, in which computing the correct step size represents most operations needed per algorithm iteration. In these regards, the fed back to the inputs of the integrators. In analog accelerator is potentially faster than this fully formed circuit, the solution vector’s discrete-time architectures. Finally, no power- time derivative is a linear function of the sol- hungry clock signal is needed to synchronize ution vector itself. operations. The integrators are charged to an initial condition representing the iterative method’s Continuous Value Efficiency initial guess. The accelerator starts computa- The analog accelerator solves the system of tion by releasing the integrator, allowing its linear equations using real numbers encoded output to deviate from its initial value. The in voltage and current, so each wire can rep- variables contained in the integrators con- resent the full range of values in the analog verge on the correct solution vector that satis- accelerator. In contrast, changing the value of fies the system of linear equations. When the a digital binary number affects many bits: analog variables are steady, we sample them sweeping an 8-bit unsigned integer from 0 to using analog-to-digital converters (ADCs). 255 needs 502 binary inversions, whereas a These techniques were used in early ana- more economical Gray encoding still needs 2–4 log computers and have recently been 255 inversions. Furthermore, multiplication, explored in small-scale experiments with ana- addition, and integration are all comparatively 5,6 log computation. straightforward on analog variables compared to digital ones. This contrasts with floating- Analog Linear Algebra Advantages point arithmetic, in which the logarithmically Solving linear algebra problems using ODEs encoded exponent portion of digital floating- on an analog accelerator has several potential point variables makes it complicated to add advantages compared to using a discrete- and subtract variables. In these regards, analog time algorithm on a digital general-purpose encoding is potentially more efficient than or special-purpose system. digital, binary encodings...... 32 IEEE MICRO Table 1. Analog accelerator instruction set architecture.

Instruction type Instruction Parameters Instruction effect

Control Initialize — Find input and output offset and gain calibration settings for all function units. Configuration Set connection Source, destination Set a crossbar switch to create an analog current connection between two analog interfaces. Configuration Set initial condition Pointer to an integrator, initial Charge integrator capacitors to have ODE initial condition value condition value. Configuration Set multiplier gain Pointer to a multiplier, gain value Amplify values by constant coefficient gain. Configuration Set constant offset Pointer to a DAC, offset value Add a constant bias to values. Configuration Set timeout time Number of digital controller Stop analog computation after specified time clock cycles once started. Configuration Configuration commit — Finish configuration and write any new changes to chip registers. Control Execution start — Start analog computation by letting integrators deviate from initial conditions. Control Execution stop — Stop analog computation by holding integrators at their present value. Data input Enable analog input Pointer to chip analog input Open chip analog input channel, allowing multi- channel ple chips to participate in computation. Data output Read analog value Pointer to an ADC, memory Read analog computation results from ADCs and location to store result store values. Exception Read exceptions Memory location to store result Read the exception vector indicating whether ...... analog units exceeded their valid range.....

*ADC: analog-to-digital converter; DAC: digital-to-analog converter; ODE: ordinary differential equation.

Analog Accelerator Architecture currents represent variables. Fan-out cur- The analog accelerator acts as a peripheral to rent mirrors allow the analog circuit to copy a digital host processor. The analog accelera- variables by replicating values onto different tor interface accepts an accelerator configura- branches. To sum variables, currents are added tion, which entails the connectivity between together by joining branches. Eight multi- function units, multiplier gains, DAC con- pliers allow variable-variable and constant- stants, and integrator initial conditions. variable multiplication. Additionally, the interface allows calibration, The physical prototype validates the ana- computation control, and reading of output log circuits’ functionality and allows physical values, and reporting exceptions. Table 1 measurement of component area and energy. summarizes the analog accelerator’s essential Additionally, the chip allows rapid prototyp- system calls and corresponding instructions. ing of accelerator algorithms. Using physical timing, power, and area measurements recorded by Ning Guo and Analog Accelerator Physical Prototype colleagues7 and summarized in Table 2, we We tested analog acceleration for linear alge- built a model that predicts the properties of bra using a prototype reconfigurable analog larger-scale analog accelerators. In Table 2, accelerator silicon chip implemented in “analog core power” and “analog core area” 65-nm CMOS technology (see Figures 2 show the power and area of each block that and 3). The accelerator comprises four inte- forms the analog signal path. The noncore grators, plus accompanying DACs, multi- transistors and nets not involved in analog pliers, and ADCs connected using crossbar computation include calibration and testing switches. In our analog accelerator, electrical circuits and registers. The core area and power ...... MAY/JUNE 2017 33 ...... TOP PICKS

scale up and down for different analog band- width designs. We explore how different 8 fan-out blocks 4 integrators 8 multiplier/VGAs 4 analog inputs bandwidth choices influence analog accel- erator performance and efficiency.

Mitigation of Analog Linear Algebra Disadvantages outputs 4 analog We encountered several drawbacks of analog computing, including limited accuracy, pre- cision, and scalability. We tackled each of these problems in the context of solving linear algebra, although the techniques we dis- cuss apply to other styles of analog computer architecture.

Improve Accuracy Using Calibration and Analog Exceptions Analog circuits provide limited accuracy compared to binary ones, in which values are CT CT CT CT

ADC ADC DAC DAC unambiguously interpreted as 0 or 1. Analog 8 8 8 8 hardware uses the full range of values. Subtle SPI controller variations in analog hardware due to process input

8 Digital 8 8 8 8 8 4 and temperature variation lead to undesirable Digital variations in the computation result. SPI SRAM SRAM output We identify three main sources of inaccur- acy in analog hardware: gain error, offset Figure 2. Analog accelerator architecture diagram, showing rows of analog, error, and nonlinearity. Gain and offset errors mixed-signal, and digital components, along with crossbar interconnect.7 refer to inaccurate results in multiplication “CT” refers to continuous time. Static RAMs (SRAMs) are used as lookup and summation, which can be calibrated tables for nonlinear functions (not used for the purposes of this article). away using additional DACs that adjust cir- cuit parameters to shift signals and adjust gains. These DACs are controlled by registers, whose contents are set using binary search during calibration by the digital host. The set- tings vary across different copies of functional units and accelerator chips, but remain con- stant during accelerator operation. 8× Nonlinearity errors occur when changes 8× 4× multiplier/ fan-out integrator in inputs result in disproportionate changes VGA in outputs, and when analog values exceed the range in which the circuit’s behavior is × × 2 CT ADC 2 CT DAC mostly linear, resulting in clipping of the SPI controller output, akin to overflow of digital number 2 × SRAM representations. At the same time, the host observes if the dynamic range is not fully used, which could result in low precision. When either exception type occurs, the origi- Figure 3. Die photo of an analog accelerator nal problem is rescaled to fit in the dynamic chip fabricated in 65-nm CMOS technology, range of the analog accelerator, and computa- 7 showing major components. “VGAs” are tion is reattempted. variable-gain amplifiers. The die area is The combination of widespread calibra- 2 3.8 mm . tion and exception checking ensures that the ...... 34 IEEE MICRO Table 2. Summary of analog accelerator components.

Unit type Analog core Total unit Analog core Total unit power power area area

Integrator 22 lW28lW 0.016 mm2 0.040 mm2 Fan-out 30 lW37lW 0.005 mm2 0.015 mm2 Multiplier 39 lW49lW 0.024 mm2 0.050 mm2 ADC 27 lW54lW 0.049 mm2 0.054 mm2 DAC 4.6 lW 4.6 lW 0.013 mm2 0.022 mm2 analog solution’s accuracy is within the sam- Analog accelerators can solve large-scale pling resolution of ADCs. sparse linear algebra problems by accelerating the solving of smaller subproblems. This lets Improve Sampling Precision by Focusing on analog accelerators solve problems containing Analog Steady State more variables than the number of integra- High-frequency and high-precision analog- tors in the analog accelerator. to-digital conversion is costly. So, instead of In such a scheme, the analog accelerator trying to capture the time-dependent analog finds the correct solution for a subproblem. waveform, we use the analog accelerator as a To get overall convergence across the entire linear algebra solver by solving a convergent problem, the set of subproblems would be ODE. When the analog accelerator outputs solved several times, using an outer loop iter- are steady, we can sample the solutions once ating across the subproblems. Typically, the with higher-precision ADCs. larger iteration is an iterative method operat- Even then, high-precision ADCs still fall ing on vectors, which do not have as strong short of the precision in floating-point num- convergence properties as iterative methods bers. Even though the analog variables are do on individual numbers. Therefore, it is still themselves highly precise, sampling the varia- desirable to ensure the block matrices cap- bles using ADCs can result in only 8 to 12 tured in the analog accelerator are large, so bits of precision. We get higher-precision that more of the problem is solved using the results by running the analog accelerator mul- efficient lower-level solver. tiple times. We use the digital host computer to find the residual error in the solution, and Evaluation we set up the analog accelerator to solve a new We compare the analog accelerator and digi- problem, focusing on the residual. Each prob- tal approaches in terms of performance, hard- lem has smaller-magnitude variables than pre- ware area, and energy consumption, while viousruns,whichletsusscaleupthevariables varying the number of problem variables and to fit the dynamic range of the analog hard- the choice of analog accelerator component ware. We can iterate between analog and digi- bandwidth, a measure of how quickly the tal hardware a few times to get a more precise analog circuit responds to changes. result than using the analog hardware alone. Analog Bandwidth Model Tackle Larger Problems by Accelerating Sparse The prototype chip has a relatively low analog Linear Algebra Subproblems bandwidth of 20 KHz, a design that ensures Modern workloads routinely need thousands that the prototype chip accurately solves for of variables, corresponding to as many analog time-dependent solutions in ODEs. How- integrators in the accelerator, exceeding the ever, the prototype’s small bandwidth makes area constraints of realistic analog accelerators. it unrepresentative of an analog accelerator Furthermore, the analog datapath is fixed designed to solve time-independent algebraic during continuous time operation, so there is equations, in which accuracy degradation in no way to dynamically load variables from time-dependent behavior has no impact on and store variables to main memory. the final steady state output. We scale up the ...... MAY/JUNE 2017 35 ...... TOP PICKS

solution time, but also increases area and 1.2 energy consumption. As Figures 4 and 5 1.0 show, we assume an analog accelerator with 0.8 bandwidth multiplied by a factor of a has 0.6 higher power and area consumption in the core analog circuits, by a factor of a. 0.4 The projected analog power figures are 0.2 significantly below the thermal design power 0.0 Maximum activity power (W) 0 500 1,000 1,500 2,000 of clocked digital designs of equal area. Even 2 Total grid points in the designs that fill a 600 mm die size, the 20 KHz 80 KHz 320 KHz 1.3 MHz analog accelerator uses about 0.7 W in the base prototype design and about 1.0 W in the design with 320 KHz of bandwidth. Figure 4. Power versus analog accelerator size for various bandwidth choices. We Sparse Linear Algebra Case Study observe that analog circuits operate faster We use as our test case a sparse system of when the internal node voltages linear equations derived from a multigrid representing variables change more quickly. elliptic partial differential equation (PDE) We hold the capacitance fixed to the solver. In multigrid PDE solvers, the overall capacitance of the prototype’s design, and PDE is converted to several linear algebra use larger currents that draw more power problems with varying spatial resolution. to charge and discharge the node Lower-resolution subproblems are quickly capacitances in the signal paths carrying solved and fed to high-resolution subpro- variables. blems, aiding the high-resolution problem to converge faster. The linear algebra subpro- blems can be solved approximately. Overall 6.00E+02 accuracy of the solution is guaranteed by iter- ating the multigrid algorithm. Because perfect ) 2 4.00E+02 convergence is not required, less stable, inac- curate, and low-precision techniques, such as analog acceleration, can support multigrid.

Area (mm Area 2.00E+02 In our case, we compare the analog accel- erator designs to a conjugate gradient algo- 0.00E+00 0 500 1,000 1,500 2,000 rithm running on a CPU, solving to equal Total grid points (relatively low) solution precision, equivalent 20 KHz 80 KHz 320 KHz 1.3 MHz to the precision obtained from one run of the analog accelerator equipped with high- resolution ADCs. On the digital side, the Figure 5. Area versus analog accelerator numerical iteration stops short of the machine size for various bandwidth choices. We precision provided by high-precision digital observe that the transistor aspect ratio W/L floating-point numbers. must increase to increase the current, and The conjugate gradient algorithm uses a therefore bandwidth, of the design. L is sustained 20 clock cycles per numerical itera- kept at a minimum dictated by the tion per row element. The comparison technology node, leaving bandwidth to be assumes identical transfer cost of data from linearly dependent on W. Thus, we main memory to the accelerator versus the estimate area increasing linearly with CPU: the energy needed to transfer data to bandwidth. and from memory is not modeled, due to the relatively small problem sizes, allowing the model’s bandwidth, within reason, up to program data to be entirely cache resident. 1.3 MHz. As Figure 6 shows, we found that an opti- Increasing the bandwidth of the analog mal analog accelerator design that balances circuit design proportionally decreases the performance and the number of integrators ...... 36 IEEE MICRO should have components with an analog bandwidth of approximately 320 KHz. With 200 our bandwidth model, high-bandwidth ana- log computers come with high area cost,

s) 150 quickly reaching the area of the largest CPU µ or GPU dies. On performance and energy metrics, we find that, with 400 integrators 100 operatingat320KHzofanalogbandwidth, analog acceleration can potentially have a 10- 50 times faster solution time; using our analog Convergence time ( bandwidth model for power, this design corresponds to 33 percent lower energy con- 0 sumption compared to a digital general- 0 200 400 600 purpose processor. Total grid points Digital conjugate gradients e recognize that the performance in- Analog 20 KHz Linear (analog 20 KHz) creases and energy savings are not as W Linear (analog 80 KHz projection) drastic as one expects when using a domain- Linear (analog 320 KHz projection) specific accelerator built on a fundamentally Linear (analog 1.3 MHz projection) different computing model than digital, syn- chronous computing. The reason for this shortfall is twofold. Figure 6. Comparison of time taken to First, the high area cost of high-bandwidth converge to equivalent precision, for high- analog components limits the problem sizes bandwidth analog accelerators and a digital that can fit in the accelerator, and therefore CPU. The time needed to converge is limits the analog performance advantage. plotted against the linear algebra problem Second, the extreme importance of linear vector size. We give the projected solution algebra problems has also led to intense time for 80-KHz, 320-KHz, and 1.3-MHz research in optimal algorithms and hardware analog accelerator designs. The high- support. Although discrete-time operation has bandwidth designs have increasing area cost. In this plot, the 320-KHz and 1.3-MHz drawbacks, it permits algorithms to intelli- 2 gently select a step size, which has advantages designs hit the size of 600 mm , the size of in solving systems of linear equations. Both the largest GPUs, so the projections are cut the analog and digital solvers perform iterative short. The convergence time for digital is numerical algorithms, but the digital program the software runtime on a single CPU core. runs the conjugate gradient method, the most efficient and sophisticated of the classical relaxation or steepest descent. Although we iterative algorithms. In the conjugate gradient can consider the analog accelerator as doing method, each step size is chosen, considering continuous-time steepest descent, taking many the gradient magnitude of the present point, infinitesimal steps in continuous time, doing alongwiththehistoryofstepsizes.Withthese many iterations of a poor algorithm is in this additional calculations, the conjugate gradient case no match for a better algorithm. method avoids taking redundant steps, accel- Efficient discrete-time algorithms such as erating toward the answer when the error is conjugate gradient and multigrid have been large and slowing when close to convergence. known to researchers since the 1950s. Analog In contrast, the analog accelerator has computers remained in use in the 1960s to fewer iterative algorithms it can carry out. In solve steepest descent due to their better using the analog accelerator for linear alge- immediate performance relative to early digi- bra, the design’s bandwidth limits the con- tal computers. vergence rate, so the convergence rate within Changing the basic abstractions in com- a time interval cannot be arbitrarily large. puter architecture could change what types Therefore, the numerical iteration in the of problems are solvable. Interesting physi- analog accelerator is akin to fixed-step size cal phenomena are usually continuous-time, ...... MAY/JUNE 2017 37 ...... TOP PICKS

analog, nonlinear, and often stochastic, so the IEEE 43rd Ann. Int’l Symp. Computer Archi- computer architectures and mathematical tecture (ISCA), 2016, pp. 570–582. abstractions for simulating these processes should also be continuous-time and analog. Yipeng Huang is a PhD candidate in the Although analog acceleration has limited ben- Computer Architecture and Security Tech- efits for solving linear algebra, analog accelera- nologies Lab at Columbia University. His tion holds promise in problem classes such as research interests include applications of nonlinear systems, in which digital algorithms analog computing and benchmarking of and hardware architectures have been less suc- robotic systems. Huang has an MPhil in cessful. In this regard, this article could be the computer science from Columbia Univer- first in a line of work redefining what prob- sity. He is a member of the IEEE Computer lems are tractable and should be pursued for Society and ACM SIGARCH. Contact him analog computing. MICRO at [email protected].

Ning Guo is a hardware engineer at Cog- Acknowledgments nescent. His research interests include con- This work is supported by NSF award CNS- tinuous-time analog/hybrid computing and 1239134 and a fellowship from the Alfred P. energy-efficient approximate computing. Sloan Foundation. This article is based on 8 Guo received a PhD in electrical engineer- our ISCA 2016 paper. ing from Columbia University, where he performed the work for this article. Contact ...... him at [email protected]. References 1. W.H. Press et al., Numerical Recipes: The Mingoo Seok is an assistant professor in the Art of Scientific Computing, 3rd ed., Cam- Department of Electrical Engineering at bridge Univ. Press, 2007. Columbia University. His research interests 2. W. Chen and L.P. McNamee, “Iterative Sol- include low-power, adaptive, and cognitive ution of Large-Scale Systems by Hybrid VLSI systems design. Seok received a PhD Techniques,” IEEE Trans. Computers, vol. in electrical engineering from the Univer- C-19, no. 10, 1970, pp. 879–889. sity of Michigan, Ann Arbor. He has 3. W.J. Karplus and R. Russell, “Increasing received an NSF CAREER award and is a Digital Computer Efficiency with the Aid of member of IEEE. Contact him at mgseok@ Error-Correcting Analog Subroutines,” IEEE ee.columbia.edu. Trans. Computers, vol. C-20, no. 8, 1971, Yannis Tsividis is the Edwin Howard Arm- pp. 831– 837. strong Professor of Electrical Engineering at 4. G. Korn and T. Korn, Electronic Analog and Columbia University. His research interests Hybrid Computers, McGraw-Hill, 1972. include analog and hybrid analog/digital 5. C.C. Douglas, J. Mandel, and W.L. Miranker, integrated circuit design for computation “Fast Hybrid Solution of Algebraic Systems,” and signal processing. Tsividis received a SIAM J. Scientific and Statistical Computing, PhD in electrical engineering from the Uni- vol. 11, no. 6, 1990, pp. 1073–1086. versity of California, Berkeley. He is a Life 6. Y. Zhang and S.S. Ge, “Design and Analysis Fellow of IEEE. Contact him at tsividis@ of a General Recurrent Neural Network ee.columbia.edu. Model for Time-Varying Matrix Inversion,” Simha Sethumadhavan is an associate pro- IEEE Trans. Neural Networks, vol. 16, no. 6, fessor in the Department of Computer Science 2005, pp. 1477–1490. at Columbia University. His research interests 7. N. Guo et al., “Energy-Efficient Hybrid Ana- include computer architecture and computer log/Digital Approximate Computation in security. Sethumadhavan received a PhD in Continuous Time,” IEEE J. Solid-State Cir- computer science from the University of Texas cuits, vol. 51, no. 7, 2016, pp. 1514–1524. at Austin. He has received an Alfred P. Sloan 8. Y. Huang et al., “Evaluation of an Analog fellowship and an NSF CAREER award. Con- Accelerator for Linear Algebra,” Proc. ACM/ tact him at [email protected]...... 38 IEEE MICRO

...... DOMAIN SPECIALIZATION IS GENERALLY UNNECESSARY FOR ACCELERATORS

......

DOMAIN-SPECIFIC ACCELERATORS (DSAS), WHICH SACRIFICE PROGRAMMABILITY FOR

EFFICIENCY, ARE A REACTION TO THE WANING BENEFITS OF DEVICE SCALING.THIS ARTICLE

DEMONSTRATES THAT THERE ARE COMMONALITIES BETWEEN DSASTHATCANBE

EXPLOITED WITH PROGRAMMABLE MECHANISMS.THE GOALS ARE TO CREATE A

PROGRAMMABLE ARCHITECTURE THAT CAN MATCH THE BENEFITS OF A DSA AND TO

Tony Nowatzki CREATE A PLATFORM FOR FUTURE ACCELERATOR INVESTIGATIONS. University of California, Los Angeles ...... Performance improvements from types (server, mobile, wearable), and creating general-purpose processors have proved elusive fundamentally new designs for each costs Vinay Gangadhar in recent years, leading to a surge of interest in both design and validation time. More sub- more narrowly applicable architectures in the tly, most devices run several different impor- Karthikeyan hope of continuing system improvements in at tant workloads (such as mobile systems on least some significant application domains. A chip), and therefore multiple DSAs will be Sankaralingam popular approach so far has been building required; this could mean that although domain-specific accelerators (DSAs): hardware each DSA is area-efficient, a combination of University of engines capable of performing computations DSAs might not be. in a particular domain with high performance Critically, the alternative to domain spe- Wisconsin–Madison and energy efficiency. DSAs have been devel- cialization is not necessarily standard general- oped for many domains, including machine purpose processors, but rather programmable Greg Wright learning, cryptography, regular expression and configurable architectures that employ matching and parsing, video decoding, and similar microarchitectural mechanisms for Research databases. DSAs have been shown to achieve specialization. The promises of such an 10 to 1,000 times performance and energy architecture are high efficiency and the abil- benefits over high-performance, power-hungry ity to be flexible across workloads. Figure 1a general-purpose processors. depicts the two specialization paradigms at a For all of their efficiency benefits, DSAs high level, leading to the central question of sacrifice programmability, which makes them this article: How far can the efficiency of prone to obsoletion—the domains we need programmable architectures be pushed, and to specialize, as well as the best algorithms to can they be competitive with domain-specific use, are constantly evolving with scientific designs? progress and changing user needs. Moreover, To this end, this article first observes that the relevant domains change between device although DSAs differ greatly in their design ......

40 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE choices, they all employ a similar set of spe- cialization principles: Cache Matching of the hardware concurrency Traditional to the enormous parallelism typically multicore present in accelerated algorithms. (Specialization Core Core Core Core Problem-specific functional units alternatives) (FUs) for computation. Domain-specific acceleration Programmable specialization Deep neural, Explicit communication of data as RegExp AI Neural Cache neural Cache approxi- approximation opposed to implicit transfer through Graph mation Scan Graph, AI, shared (register and memory) address traversal Linear RegExp algebra Deep Stencil, Scan, Core Core spaces in a general-purpose instruc- Core Core Core Stencil Core tion set architecture (ISA). neural Sort Linear, Sort

Customized structures for caching. Performance, energy benefits: 10 to 1,000 times Competitive?

Coordination of the control of the Area footprint cost: High overall Lower? other hardware units using simple, Generality/flexibility: Obsoletion prone Future proof? low-power hardware. (a) Our primary insight is that these shared principles can be exploited by composing Memory known programmable and configurable microarchitectural mechanisms. In this article, Scratchpad Scratchpad DMA we describe one such architecture, our pro- DMA Spatial fabric Spatial fabric posed design, LSSD (see Figure 1b). (LSSD D$ Input interfaceD$ Input interface stands for low-power core, spatial architecture, scratchpad, and DMA.) To exploit the concurrency present in Low- Low- power power specializable algorithms while retaining pro- core ... core grammability, we employ many tiny, low- power cores. To improve the cores for Output interface Output interface handling the commonly high degree of Specialization of: computation, we add a spatial fabric to each Concurrency Computation Communication Caching Coordination core; the spatial fabric’s network specializes the operand communication, and its FUs (b) can be specialized for algorithm-specific computation. Adding scratchpad memories Figure 1. Specialization paradigms and tradeoffs. (a) Alternate specialization enables the specialization of caching, and a paradigms. (b) Our LSSD architecture for programmable specialization. DMA engine specializes the memory com- munication. The low-power core makes it research, broadening and strengthening their possible to specialize the coordination. impact. This article has two primary goals. First, we aim to show that by generalizing com- The Five C’s of Specialization mon specialization principles, we can create a programmable architecture that is compet- DSAs achieve efficiency through the employ- itive with DSAs. Our evaluation of LSSD ment of five common specialization princi- matches DSA performance with two to four ples, which we describe here in detail. We times the power and area overhead. Second, also discuss how four recent accelerator our broader goal is to inspire the use of pro- designs apply these principles. grammable fabrics like LSSD as vehicles for future accelerator research. These types of Defining the Specialization Principles architectures are far better baselines than Before we define the specialization principles, out-of-order (OoO) cores for distilling the let’s clarify that we are discussing specializa- benefits of specialization, and they can serve tion principles only for workloads that are as a platform for generalizing domain-specific most commonly targeted with accelerators...... MAY/JUNE 2017 41 ...... TOP PICKS

In particular, these workloads have significant ate values are consumed multiple times. The parallelism, either at the data or thread level; specialization of caching means using custom perform some computational work; have storage structures for these temporaries. coarse-grained units of work; and have In the context of accelerators, access pat- mainly regular memory access. terns are often known a priori, often meaning Here, we define the five principles of that low-ported, wide scratchpads (or small architectural specialization and discuss the registers) are more effective than classic caches. potential area, power, and performance tradeoffs of targeting each. Coordination specialization. Hardware coor- dination is the management of hardware Concurrency specialization. A workload’s units and their timing to perform work. concurrency is the degree to which its opera- Instruction sequencing, control flow, signal tions can be performed simultaneously. Spe- decoding, and address generation are all cializing for a high degree of concurrency examples of coordination tasks. Specializing means organizing the hardware to perform it usually involves the creation of small state work in parallel by favoring lower overhead machines to perform each task, rather than structures. Examples of specialization strat- reliance on a general-purpose processor and egies include employing many independent (for example) OoO instruction scheduling. processing elements with their own control- Performing more coordination specialization lers or using a wide vector model with a single typically means less area and power com- controller. Applying hardware concurrency pared to something more programmable, at increases the performance and efficiency of the price of generality. parallel workloads while increasing area and power. Relating Specialization Principles to Accelerator Mechanisms Computation specialization. Computations Figure 2 depicts the block diagrams of the are individual units of work in an algorithm four DSAs that we study; shading indicates performed by FUs. Specializing computation the types of specialization of each compo- means creating problem-specific FUs (for nent. We relate the specialization mecha- instance, a FU that computes sine). Specializ- nisms to algorithmic properties below. ing computation improves performance and Neural Processing Unit (NPU) is a DSA power by reducing the total work. Although for approximate computing using neural net- computation specialization can be problem- works.1 It exploits the concurrency of each specific, some commonality between domains network level, using parallel processing enti- at this level is expected. ties (PEs) to pipeline the computations of eight neurons simultaneously. NPU special- Communication specialization. Communica- izes reuse with accumulation registers and tion is the means of transmission of values per-PE weight buffers. For communication, between and among storage and FUs. Speci- NPU employs a broadcast network specializ- alized communication is simply the instantia- ing the large network fan-outs and specializes tion of communication channels and buffers computation with sigmoid FUs. A bus sched- between hardware units to facilitate faster uler and PE controller specialize the hardware operand throughput to the FUs. This reduces coordination. power by lessening access to intermediate Convolution Engine accelerates stencil- storage, and potentially to area if the alterna- like computations.2 The host core coordi- tive is a general communication network. nates control through custom instructions. It One example is a broadcast network for effi- exploits concurrency through both vector ciently sending immediately consumable and pipeline parallelism and uses custom data to many computational units. scratchpads for caching pixels and coeffi- cients. In addition, column and row interfa- Caching specialization. Specialization for ces provide shifted versions of intermediate caching exploits the inherent data reuse, which values. These, along with other wide buses, is an algorithmic property wherein intermedi- provide communication specialization. It ...... 42 IEEE MICRO NPU (Neural Processing Unit) Convolution Engine Q100 Database Processing Unit DianNao Machine Learning General-purpose processor General-purpose processor Memory Memory Bus sched. DMA DMA Control interface Temp. instruction Stream PE PE sequencer buffers 2D shift 1D shift 2D coefficient Out In Synapse buffer Out In FIFO In PE PE register register register register Config. ... Out FIFO Out PE PE Map

High-level Data ...

x64 Control procedure organization shuffle units PE PE Reduct Per- Fusion SIMD neuron tree lanes

Weight buffer Mult/Sub FIFO ... Mult. Abs/Shft Mult-Add Cont- Router Router Router Router Accumulation

units Add, roller register bypass Add/And/ Sigmoid Sort Join Filter ALU Avg, Max

Processing Nonlinear Const Const Const

Processing Out buffer engine (PE) Shift function

Specialization of: Concurrency Computation Communication Caching Coordination

Figure 2. Application of specialization principles in four domain-specific accelerators (DSAs). The elements of each DSA’s high-level organization and low-level processing unit structure are labeled according to their primary role in performing specialization. also uses a specialized graph-fusion compu- but in a programmable fashion. In this sec- tation unit. tion, we explain the architecture of LSSD, Q100 is a DSA for streaming database highlighting how it performs specialization queries, which exploits the pipeline concur- using the principles while remaining pro- rency of database operators and intermediate grammable and parameterizable for different outputs.3 To support a streaming model, it domains. This is not the only possible set of uses stream buffers to prefetch database col- mechanisms, but it is a simple and effective umns. Q100 specializes the communication set. The sidebar, “Related Programmable by providing dynamically routed channels Specialization Architectures,” discusses alter- between FUs to prevent memory spills. It native designs. uses custom database FUs, such as Join, Sort, and Partition. It specializes data caching by LSSD Design storing constants and reused intermediates within these operators’ implementations. The most critical principle is exploiting con- The communication network configuration currency, of which there is typically an abun- and stream buffers are coordinated using an dant amount when considering specializable instruction sequencer. workloads. Requiring high concurrency DianNao is a DSA for deep neural net- pushes the design toward simplicity, and works.4 It achieves concurrency by applying a requiring programmability implies the use of very wide vector computation model and some sort of programmable core. The natural uses wide memory structures (4,096-bit wide way to satisfy these is to use an array of tiny static RAMs) for reuse specialization of low-power cores that communicate through neurons, accumulated values, and synapses. memory. This is a sensible tradeoff because DianNao also relies on specialized sigmoid commonly specialized workloads exhibit little FUs. Point-to-point links between FUs, communication between the coarse-grained with little bypassing, specialize the commu- parallel units. The remainder of the design is nication. A specialized control processor is a straightforward application of the remaining used for coordination. specialization principles. Achieving communication specialization of intermediate values requires an efficient An Architecture for Programmable distribution mechanism for operands that Specialization avoids expensive intermediate storage such Our primary insight is that well-understood as multiported register files. Arguably, the mechanisms can be composed to target the best-known approach is an explicit routing same specialization principles that DSAs use, network that is exposed to the ISA to ...... MAY/JUNE 2017 43 ...... TOP PICKS

Related Programmable Specialization Architectures One related architecture is Smart Memories,1 which when config- buffer. Here, the frame buffer is not used for data reuse, and the ured acts like either a streaming engine or a speculative multi- CGRA is more loosely coupled with the host core. processor. Its primary innovations include mechanisms that let static RAMs (SRAMs) act as either scratchpads or caches for reuse. Smart Memories is both more complex and more general References than LSSD, although it’s likely less efficient on the regular work- 1. K. Mai et al., “Smart Memories: A Modular Reconfigura- loads we target. ble Architecture,” Proc. 27th Int’l Symp. Computer Archi- Another example is Charm,2 the composable heterogeneous tecture, 2000, pp. 161–171. accelerator-rich , which integrates coarse-grained configurable functional unit blocks and scratchpads for reuse spe- 2. J. Cong et al., “Charm: A Composable Heterogeneous Accel- cialization. A fundamental difference is in the decoupling of the erator-Rich Microprocessor,” Proc. ACM/IEEE Int’l Symp. computation units, reuse structures, and host cores, which allows Low Power Electronics and Design, 2012, pp. 379–384. concurrent programs to share blocks in complex ways. 3. R. Krashinsky et al., “The Vector-Thread Architecture,” The Vector-Thread architecture supports unified vector-and- Proc. 31st Ann. Int’l Symp. Computer Architecture, 2004, multithreading execution, providing flexibility across data-parallel pp. 52–63. 3 and irregularly parallel workloads. The most similar design in 4. H. Singh et al., “MorphoSys: An Integrated Reconfigura- 4 terms of microarchitecture is MorphoSys. It also embeds a low- ble System for Data-Parallel and Computation-Intensive power TinyRisc core, integrated with a coarse-grained reconfigur- Applications,” IEEE Trans. Computers, vol. 49, no. 5, able architecture (CGRA), direct memory access engine, and frame 2000, pp. 465–481.

eliminate the hardware burden of dynamic shown in Figure 1b. It is programmable, routing. This property is what defines spatial has high efficiency through the application architectures, and we add a spatial architec- of specialization principles, and has simple ture as our first mechanism. This serves as an parameterizability. appropriate place to instantiate custom FUs—that is, computation specialization. It Use of LSSD in Practice also enables specialization of caching constant Preparing the LSSD fabric for use occurs values associated with specific computations. in two phases—design synthesis and To achieve communication specialization programming. with the global memory, a natural solution is For specialized architectures, design syn- to add a DMA engine and configurable thesis is the process of provisioning for given scratchpad, with a vector interface to the spa- performance, area, and power goals. It tial architecture. The scratchpad, configured involves examining one or more workload as a DMA buffer, enables the efficient stream- domains and choosing the appropriate FUs, ing of memory by decoupling memory access the datapath size, the scratchpad sizes and from the spatial architecture. When config- widths, and the degree of concurrency ured differently, the scratchpad can act as a exploited through multiple core units. programmable cache. A single-ported scratch- Although many optimization strategies are pad is enough, because access patterns are possible, in this work, we consider the pri- usually simple and known ahead of time. mary constraint to be performance—that is, Finally, to coordinate the hardware units there exists some throughput target that must (for example, synchronizing DMA with the be met, and power and area should be mini- computation), we use the simple core, which mized, while still retaining some degree of is programmable and general. The overhead is generality and programmability. low, provided the core is low-power enough, Programming an LSSD has two major and the spatial architecture is large enough. components: creation of the coordination Thus, each unit of our proposed fabric code for the low-power core and generation contains a low-power core, a spatial architec- of the configuration data for the spatial ture, scratchpad, and DMA (LSSD), as datapath to match available resources...... 44 IEEE MICRO Table 1. Methodology for obtaining DSA baseline characteristics.

DSA Execution time Power/Area

Neural Processing Unit (NPU) Authors provided MCPAT-based estimation Convolution Engine Authors provided MCPAT-based estimation Q100 Optimistic model In original paper3 DianNao...... Optimistic model In original paper4

*All area and power estimates are scaled to 28 nm.

Programming for LSSD in assembly might neural network approximation, convolution, be reasonable because of the simplicity of databases, and deep neural networks work- both the control and data portions. In prac- loads, respectively. We also consider a bal- tice, using either standard languages with anced design (LSSDB), which contains a #pragma annotations or languages like superset of the capabilities of each of the OpenCL would likely be effective. above and can execute all workloads with required performance. Design Provisioning and Methodology Evaluation Methodology In this section, we describe the design points At a high level, our methodology attempts to that we study in this work, along with the fairly assess LSSD tradeoffs across workloads provisioning and evaluation methodology. from four accelerators through pulling data More details are in our original paper for the from past works and the original authors, 2016 Symposium on High Performance applying performance modeling techniques, Computer Architecture.5 using sanity checking against real systems, and using standard area/power models. Implementation Building Blocks Where assumptions were necessary, we made We use several existing components, both those that favored the benefits of the DSA. from the literature and from available designs, as building blocks for the LSSD LSSD performance estimation. Our strategy architecture. The spatial architecture we lev- uses a combined trace-simulator and applica- erage is the DySER coarse-grained reconfig- 6 tion-specific modeling framework to capture urable architecture (CGRA), which is a the role of the compiler and the LSSD pro- lightweight, statically routed mesh of FUs. gram. This framework is parameterizable for Note that we will use CGRA and “spatial different FU types, concurrency parameters architecture” interchangeably henceforth. (single-instruction, multiple-data [SIMD] The processor we leverage is a Tensilica width and LSSD unit counts), and reuse and LX3, which is a simple, very long instruction communication structures. word design featuring a low-frequency (1 GHz) seven-stage pipeline. We chose this LSSD power and area estimation. Integer FU because of its low area and power footprint estimates come from DianNao4 and floating- and because it can run irregular code. point FUs are from DySER.6 CGRA-net- work estimates come from synthesis, and LSSD Provisioning static RAMs use CACTI estimates. To instantiate LSSD, we provision its param- eters to meet each domain’s performance DSA and baseline characteristics. We obtain requirements by modifying FU composition, each DSA’s performance, area, and power as scratchpad size, and the number of cores (for shown in Table 1. details, see our original paper). By provision- ing for each domain separately, we create Comparison to OoO baseline. We estimate LSSDN,LSSDC, LSSDQ,andLSSDD,for the properties of the OoO baseline (Intel’s ...... MAY/JUNE 2017 45 ...... TOP PICKS

Table 2. Breakdown and comparison of LSSD (a) area and (b) power (normalized to 28 nm).

2 Area (mm ) LSSDN LSSDC LSSDQ LSSDD

Core and cache N/A N/A 0.09 0.09 Static RAM (SRAM) 0.04 0.02 0.04 0.04 Functional unit (FU) 0.24 0.02 0.09 0.02 CGRA* network 0.09 0.11 0.22 0.11 Unit total 0.37 0.15 0.44 0.26 LSSD total area 0.37 0.15 1.78 2.11 DSA total area 0.30 0.08 3.69 0.56 LSSD/DSA overhead 1.23 1.74 0.48 3.76 (a)

Power (mW) LSSDN LSSDC LSSDQ LSSDD Core and cache 41 41 41 41 SRAM 9 5 9 5 FU 65 7 33 7 CGRA network 34 56 46 56 Unit total 149 108 130 108 LSSD total power 149 108 519 867 DSA total power 74 30 870 213 LSSD/DSA overhead 2.02 3.57 0.60 4.06 (b) ......

*CGRA: Coarse-grained reconfigurable architecture.

3770K) from datasheets and die photos, and LSSDD has the worst-case area and power frequency is scaled to 2 GHz. overheads of 3.76 and 4.06 times, respec- tively, compared to DianNao. The CGRA Evaluation network dominates area and power because it We organize our evaluation around four supports relatively tiny 16-bit FUs. The best main questions: case is LSSDQ, which has 0.48 times the area and 0.6 times the power of Q100. The pri- Q1. What is the cost of general pro- mary reason is that LSSD does not embed grammability in terms of area and the expensive Sort and Partition units. Even power? though not including these units leads to per- Q2. If multiple workloads are formance loss on several queries, we believe required on-chip, can LSSD ever sur- this to be a reasonable tradeoff overall. pass the area or power efficiency? The takeaway: with suitable engineering, Q3. What are the sources of LSSD’s we can reduce programmability overheads to performance? small factors of 2 to 4 times, as opposed to Q4. How do LSSD’s power overheads the 100- to 1,000-times inefficiency of large affect the overall energy efficiency? OoO cores. We answer these questions through detailed analysis as follows. Supporting Multiple Domains (Q2) If multiple workload domains require spe- LSSD Area/Power Overheads (Q1) cialization on the same chip, but do not need To elucidate the costs of more general pro- to be run simultaneously, it is possible that grammability, Table 2 shows the power and LSSD can be more area-efficient than a Mul- area breakdowns for the LSSD designs. tiDSA design. Figure 3a shows the area and ...... 46 IEEE MICRO geomean power tradeoffs for two workload domain sets, comparing the MultiDSA chip 70 NPU/CE/DianNao workloads 40 All workloads to the balanced LSSDB design. Multi-DSA 60 Multi-DSA The domain set (NPU, Convolution 30 50 LSSDB Engine, and DianNao) excludes our best LSSD 40 2.7× more area than DSA B result (Q100 workloads). In this case, LSSD 2.4× more power than DSA 20 0.6× more area than DSA B 2.5× more power than DSA still has 2.7 times the area and 2.4 times the 30 20 power overhead. However, with Q100 Normalized power LSSDsimd-only 10 10 added, LSSDB is only 0.6 times the area. LSSDsimd-only The takeaway: if only one domain needs 0 0 0 2 4 6 8 10 12 14 0 1 2 3 4 5 to be supported at a time, LSSD can become (a) Normalized area Normalized area more area-efficient than using multiple DSAs. 14 35 180 120 Performance Analysis (Q3) 160 12 30 100 Figure 3b shows the performance of the DSAs 140 10 25 and domain-provisioned LSSD designs, nor- 120 80 malized to the OoO core. Across workload 8 20 100 60 domains, LSSD matches the performance of 6 15 80 60 40 the DSAs, with speedups over a modern OoO 4 10 core of between 10 and 150 times. 40 Speedup over OoO core Speedup over 2 20 To elucidate the sources of benefits of each 5 20 specialization principle in LSSD, we define 0 0 0 0 LSSD vs. LSSDN vs. LSSDC vs. LSSDQ vs. D five design points (which are not power or (b) NPU Convolution Engine Q100 DianNao area normalized), wherein each builds on the LSSD (+ caching specialization) LSSD (+ communication specialization) capabilities of the previous design point: SIMD (+ concurrency) Multi-Tile (+ concurrency) LP core + SFUs (+ computation) DSA geomean CoreþSFU, the LX3 core with added problem-specific FUs (computation Figure 3. LSSD’s power, area, and performance tradeoffs. (a) Area and specialization); power of Multi-DSA versus LSSD . (Baseline: core plus L1 cache plus L2 Multicore, LX3 multicore system B cache from I3770K processor.) (b) LSSD versus DSA performance across (plus concurrency specialization); four domains. SIMD, an LX3 with SIMD, its width corresponding to LSSD’s memory inter- face (plus concurrency specialization); performance benefits come mostly from Spatial, an LX3 in which the spatial concurrency rather than other specialization architecture replaces the SIMD units techniques. (plus communication specialization); and Energy-Efficiency Tradeoffs (Q4) LSSD, the previous design plus scratch- It is important to understand how much the pad (plus caching specialization). power overheads affect the overall system-level The largest factor is consistently concur- energy benefits. Here, we apply simple analyti- rency (4 times for LSSDN,31timesfor cal reasoning to bound the possible energy- LSSDC, 9 times for LSSDQ, and 115 times efficiency improvement of a general-purpose for LSSDD). This is intuitive, because these system accelerated with a DSA versus an LSSD workloads have significant exploitable parallel- design, by considering a zero-power DSA. ism. The largest secondary factors for LSSDN We define the overall relative energy, E, for and LSSDD are from caching neural weights an accelerated system in terms of S, the accel- in scratchpad memories, which enables erator’s speedup; U, the accelerator utilization increased data bandwidth to the core. LSSDC as a fraction of the original execution time; and LSSDQ benefit from CGRA-based execu- Pcore, the general-purpose core power; Psys,the tion when specializing for communication. system power; and Pacc, the accelerator power. The takeaway is that LSSD designs have Thecorepowerincludescomponentsthatare competitive performance with DSAs. The not used because the accelerator is invoked, ...... MAY/JUNE 2017 47 ...... TOP PICKS

remaining for a DSA to optimize. The take- 1.12 away is that when an LSSD can match the 1.10 performance of a DSA, the further poten- U = 1.00 tial energy benefits of a DSA are usually 1.08 U = 0.95 small, making LSSD’s overheads largely U = 0.90 1.06 inconsequential. U = 0.75 1.04 U = 0.50 improvement U = 0.25 1.02 he broad intellectual thrust of this 1.00 article is to propose an accelerator Maximum DSA energy-efficient 0 10 20 30 40 50 T architecture that could be used to drive future Accelerator speedup (a) investigations. As an analogy, the canonical 1.35 five-stage pipelined processor was simple and 1.30 effective enough to serve as a framework for P = 5.0 almost three decades of big ideas, policies, 1.25 lssd P = 2.5 and microarchitecture mechanisms that 1.20 lssd Plssd = 1.0 drove the general-purpose processor era. Up 1.15 Plssd = 0.5 to now, architects have not focused on an

improvement 1.10 Plssd = 0.25 equivalent framework for accelerators. 1.05 Most accelerator proposals are presented 1.00 as a novel design with a unique composition Maximum DSA energy-efficient 0 10 20 30 40 50 of mechanisms for a particular domain, mak- (b) Accelerator speedup ing comparisons meaningless. However, from an intellectual standpoint, our work

Figure 4. Energy benefits of zero-power DSA (Pcore ¼ 5W,Psys ¼ 5 W). (a) shows that these accelerators are more similar

Varying utilization, Plssd ¼ 0.5 W. (b) Varying LSSD power, U ¼ 0.5 W. than dissimilar; they exploit the same essen- tial principles with differences in their imple- mentation. This is why we believe an architecture designed around these principles whereas the system power components are can serve as a standard framework. Of course, active while accelerating (such as higher-level the development of DSAs will continue to caches and DRAM). The total energy then be critical for architecture research, both to becomes enable the exploration of the limits of acceler- ation, and as a means to extract new accelera- ¼ ð = Þþ ð þ = Þ E Pacc U S Psys 1 U U S tion principles. þ Pcoreð1 U Þ In the literature today, DSAs are pro- posed and compared to conventional high- We characterize system-level energy tradeoffs performance processors, and typically yield across accelerator utilizations and speedups several orders of magnitude better measure- in Figure 4. Figure 4a shows that the maxi- ments on various metrics of interest. For the mum benefits from a DSA are reduced both four DSAs we looked at, the benefit of spe- as the utilization goes down (stressing core cialization is only two to four times for power) and when accelerator speedup LSSD in area and power when performance- increases (stressing both core and system normalized. Therefore, we argue that the power). For a reasonable utilization of U ¼ overheads of OoO processors and GPUs 0.5 and speedup of S ¼ 10, the maximum make them poor targets to distill the true energy efficiency gain from a DSA is less than benefit of specialization. 0.5 percent. Figure 4b shows a similar graph, Using LSSD as a baseline will reveal in which LSSD’s power is varied, whereas uti- more and deeper insights on what techni- lization is fixed at U ¼ 0.5. Even considering ques are truly needed for a particular prob- an LSSD with power equivalent to the core, lem or domain, as opposed to merely when LSSD has a speedup of 10 times, there removing the inefficiency of a general-pur- is only 5 percent potential energy savings pose OoO processor using already-known ...... 48 IEEE MICRO techniques applied in a straightforward man- for Programming Language and Operating ner to a new domain. Systems, 2014, pp. 255–268. Orthogonally, LSSD can serve as a guide- 4. T. Chen et al., “DianNao: A Small-Footprint line for discovering big ideas for specialization. High-Throughput Accelerator for Ubiquitous Undoubtedly, there are additions necessary to Machine-Learning,” Proc. 19th Int’l Conf. the five principles, alternative formulations, Architectural Support for Programming Lan- and microarchitecture extensions. These guage and Operating Systems, 2014, pp. include ideas that have been demonstrated in 269–284. an accelerator’s specific context, as well as prin- 5. T. Nowatzki et al., “Pushing the Limits of ciples not discovered or defined yet. Accelerator Efficiency While Retaining Pro- To see how LSSD can serve as a frame- grammability,” Proc. IEEE Int’l Symp. High work for generalizing accelerator-specific Performance Computer Architecture, 2016, 7 mechanisms, consider the recent Proteus pp. 27–39. and Cnvlutin8 works, which proposed mech- anisms for extending the DianNao accelera- 6. V. Govindaraju et al., “DySER: Unifying tor. The idea of bit-serial multiplication Functionality and Parallelism Specialization (Proteus) and eliminating zero-computing for Energy Efficient Computing,” IEEE (Cnvlutin) can apply to the database- and Micro, vol. 32, no. 5, 2012, pp. 38–51. image-processing domains we considered, 7. P. Judd et al., “Proteus: Exploiting Numeri- and evaluating with LSSD enables the study cal Precision Variability in Deep Neural of these mechanisms’s generalizability. Networks,” Proc. Int’l Conf. Supercomput- Our formulation of principles makes clear ing, 2016, article 23. what workload behaviors are currently 8. J. Albericio et al., “Cnvlutin: Ineffectual- uncovered and need discovery of new princi- Neuron-Free Deep Neural Network ples to match existing accelerators. This Computing,” Proc. ACM/IEEE 43rd Ann. direction leads to the more open questions of Int’l Symp. Computer Architecture, 2016, whether the number of principles are eventu- pp. 1–13. ally too numerous to be practical to put in a single substrate, whether efficient mecha- Tony Nowatzki is an assistant professor in nisms can be discovered to target many prin- the Department of Computer Science at the ciples with a single substrate, or whether they University of California, Los Angeles. His are sufficiently few in number such that one research interests include architecture and can build a single universal framework. compiler codesign and mathematical mod- Overall, the specialization principles and eling. Nowatzki received a PhD in computer LSSD-style architectures can be used to science from the University of Wisconsin– decouple accelerator research from workload Madison. He is a member of IEEE. Contact domains, which we believe can help foster him at [email protected]. more shared innovation in this space. MICRO ...... Vinay Gangadhar is a PhD student in the Department of Electrical and Computer References Engineering at the University of Wiscon- 1. H. Esmaeilzadeh et al., “Neural Acceleration sin–Madison. His research interests include for General-Purpose Approximate Programs,” hardware/software codesign of program- Proc. 45th Ann. IEEE/ACM Int’l Symp. Micro- mable accelerators, microarchitecture, and architecture, 2012, pp. 449–460. GPU computing. Gangadhar received an 2. W. Qadeer et al., “Convolution Engine: Bal- MS in electrical and computer engineering ancing Efficiency and Flexibility in Specialized from the University of Wisconsin–Madison. Computing,” Proc. 40th Ann. Int’l Symp. He is a student member of IEEE. Contact Computer Architecture, 2013, pp. 24–35. him at [email protected]. 3. L. Wu et al., “Q100: The Architecture and Design of a Database Processing Unit,” Karthikeyan Sankaralingam is an associate Proc. 19th Int’l Conf. Architectural Support professor in the Department of Computer ...... MAY/JUNE 2017 49 ...... TOP PICKS

Sciences and the Department of Electrical and Computer Engineering at the Univer- sity of Wisconsin–Madison, where he also leads the Vertical Research Group. His research interests include microarchitecture, architecture, and very large-scale integra- tion. Sankaralingam received a PhD in com- puter science from the University of Texas at Austin. He is a senior member of IEEE. Contact him at [email protected].

Greg Wright is the director of engineering at Qualcomm Research. His research inter- ests include processor architecture, virtual machines, compilers, parallel algorithms, and memory models. Wright received a PhD in computer science from the Univer- sity of Manchester. Contact him at [email protected].

Read your subscriptions through the myCS publications portal at http://mycs.computer.org...... 50 IEEE MICRO

...... CONFIGURABLE CLOUDS

......

THE CONFIGURABLE CLOUD DATACENTER ARCHITECTURE INTRODUCES A LAYER OF Adrian M. Caulfield RECONFIGURABLE LOGIC BETWEEN THE NETWORK SWITCHES AND SERVERS.THE AUTHORS Eric S. Chung DEPLOY THE ARCHITECTURE OVER A PRODUCTION SERVER BED AND SHOW HOW IT CAN Andrew Putnam ACCELERATE APPLICATIONS THAT WERE EXPLICITLY PORTED TO FIELD-PROGRAMMABLE Hari Angepat GATE ARRAYS, SUPPORT HARDWARE-FIRST SERVICES, AND ACCELERATE APPLICATIONS Daniel Firestone WITHOUT ANY APPLICATION-SPECIFIC FPGA CODE BEING WRITTEN. Jeremy Fowers

Michael Haselman ...... Hyperscale clouds (hundreds of tion to the core server infrastructure. Distrib- thousands to millions of servers) are an uting the accelerators to each server in the Stephen Heil attractive option to run a vast and increasing datacenter retains homogeneity, allows more number of applications and workloads span- efficient scaling, allows services to run on all Matt Humphrey ning web services, data processing, AI, and the servers, and simplifies management by the Internet of Things. Modern hyperscale reducing costs and configuration errors. The Puneet Kaur datacenters have made huge strides with question of which method is best is mostly improvements in networking, virtualization, one of economics: is it more cost effective to Joo-Young Kim energy efficiency, and infrastructure manage- deploy an accelerator in every new server, to ment, but they still have the same basic struc- specialize a subset of an infrastructure’s new Daniel Lo ture they’ve had for years: individual servers servers and maintain an ever-growing num- with multicore CPUs, DRAM, and local ber of configurations, or to do neither? Todd Massengill storage, connected by the network interface Any specialized accelerator must be com- card (NIC) through Ethernet switches to patible with the target workloads through its Kalin Ovtcharov other servers. However, the slowdown in deployment lifetime (for example, six years— both CPU scaling and the end of Moore’s two years to design and deploy the accelerator Michael Papamichael law has resulted in a growing need for hard- and four years of server deployment lifetime). ware specialization to increase performance This requirement is a challenge given both Lisa Woods and efficiency. the diversity of cloud workloads and the rapid There are two basic ways to introduce rate at which they change (weekly or Sitaram Lanka hardware specialization into the datacenter: monthly). It is thus highly desirable that Derek Chiou one is to form centralized pools of specialized accelerators incorporated into hyperscale serv- machines, which we call “bolt-on” accelera- ers be programmable. The two most common Doug Burger tors, and the other is to distribute the special- examples are field-programmable gate arrays ized hardware to each server. Introducing (FPGAs) and GPUs, which (in this regard) Microsoft bolt-on accelerators into a hyperscale infra- are preferable over ASICs. structure reduces the highly desirable homo- Both GPUs and FPGAs have been geneity and limits the scalability of the deployed in datacenter infrastructure at reason- specialized hardware, but minimizes disrup- able scale, but with limited direct connectivity ......

52 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE Projects Related to the Configurable Cloud Architecture For a complete analysis of related work and a taxonomy of various 2. T. Chen et al., “DianNao: A Small-Footprint High-Through- system design options, see our original paper.1 In this sidebar, we put Accelerator for Ubiquitous Machine-Learning,” ACM focus on projects with related system architectures. SIGPLAN Notices, vol. 49, no. 4, 2014, pp. 269–284. Hyperscale accelerators commonly comprise three types of 3. M. Abadi et al., “TensorFlow: A System for Large-Scale accelerators: custom ASICs, application-specific processors and Machine Learning,” Proc. 12th USENIX Symp. Operat- GPUs, and field-programmable gate arrays (FPGAs). Custom ASICs ing Systems Design and Implementation, 2016, pp. 2 solutions, such as DianNao, provide excellent performance and 265–283. efficiency for their target workload. However, ASICs are inflexible, 4. Nvidia NVLink High-Speed Interconnect: Application Per- so they restrict the rapidly evolving applications from evolving formance, white paper, Nvidia, Nov. 2014. while still being able to use the accelerator. ’s Tensor Processing Unit (TPU)3 is an application- 5. “Amazon EC2 F1 Instances (Preview),” 2017; http://aws specific processor that has been highly tuned to execute Tensor- .amazon.com/ec2/instance-types/f1. Flow. GPUs have been commonly used in datacenters as bolt-on 6. J. Ouyang et al., “SDA: Software-Defined Accelerator for accelerators, even with small-scale interconnection networks Large-Scale DNN Systems,” Proc. Hot Chips 26 Symp., such as NVLink.4 However, the size and power requirements for 2014; doi:10.1109/HOTCHIPS.2014.7478821. GPUs is still much higher than for FPGAs. 7. J. Weerasinghe et al., “Enabling FPGAs in Hyperscale 5 6 FPGA deployments, including Amazon’s EC2 F1, Baidu’s SDA, Data Centers,” Proc. IEEE 12th Int’l Conf. Ubiquitous 7 8 1 IBM’s FPGA fabric, Novo-G, and our first-generation architecture, all Intelligence and Computing, 12th Int’l Conf. Autonomic cluster FPGAs into a small subset of the architecture—bolt-on acceler- and Trusted Computing, and 15th Int’l Conf. Scalable ators. None of these have the scalability or ability to benefit the base- Computing and Communications (UIC-ATC-ScalCom), line datacenter server as the Configurable Cloud architecture does. 2015; doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP .2015.199. 8. A.G. Lawande, A.D. George, and H. Lam, “Novo-G#: A References Multidimensional Torus-Based Reconfigurable Cluster for 1. A. Putnam et al., “A Reconfigurable Fabric for Accelerat- Molecular Dynamics,” Concurrency and Computation: ing Large-Scale Datacenter Services,” Proc. 41st Ann. Practice and Experience, vol. 28, no. 8, 2016, pp. Int’l Symp Computer Architecture, 2014, pp. 13–24. 2374–2393.

between accelerators, such as intraserver con- requires awareness of the machines’ nectivity or small-scale ring and torus net- physical location. works. Our first deployment in a production  Failure handling requires complex hyperscale datacenter was 1,632 servers, rerouting of traffic to neighboring each with an FPGA, to accelerate Bing web nodes, causing both performance loss search ranking. The FPGAs were connected and isolation of nodes under certain to each other in a 6-Â-8 torus network in a failure patterns. half rack. Although effective at accelerating  These fabrics are limited-scale bolt- search ranking, our first architecture1 and on accelerators, which can accelerate similar small-scale connectivity architectures applications but offer few enhance- (see the sidebar, “Projects Related to the ments for the datacenter infrastruc- Configurable Cloud Architecture”) have sev- ture, such as networking and storage eral significant limitations: flows.  Programs must be aware of where  The number of FPGAs that could their applications are running and communicate directly, without going how many specialized machines are through software, is limited to a sin- available, not just the best way to accel- gle server or single rack (that is, 48 erate a given application. nodes).  The secondary network requires We propose the Configurable Cloud, a expensive and complex cabling and new FPGA-accelerated hyperscale datacenter ...... MAY/JUNE 2017 53 ...... TOP PICKS

L2 Network switch (top of rack, cluster) FPGA-switch link L1 L1 FPGA acceleration board NIC-FPGA link TOR TOR Two-socket CPU server Two-socket server blade DRAM DRAM Datacenter HW acceleration plane TOR TOR Deep neural Expensive CPU QPI CPU networks compression Web search Gen3 x8 Gen3 2x8 Bioinformatics ranking NIC FPGA DRAM

Web search QSFPAccelerator card QSFP QSFP 40 Gbps 40 Gbps ranking TOR

Traditional SW (CPU) server plane (a) (b)

Figure 1. Enhanced datacenter architecture. (a) Decoupled programmable hardware plane. (b) Server and field-programmable gate array (FPGA) schematic. (NIC: network interface card; TOR: top of rack.)

architecture that addresses these limitations.2 accelerated and the scalability of the special- This architecture is sufficiently robust and ized accelerator fabric is profound. performant that it has been, and is being, Integration with the network lets every deployed in most new servers in Microsoft’s FPGA in the datacenter reach every other one production datacenters across more than 15 (at a scale of hundreds of thousands) in under countries and 5 continents. 10 microseconds on average, massively increasing the scale of FPGA resources avail- able to applications. It also allows for accelera- The Configurable Cloud tion of network processing, a common task Our Configurable Cloud architecture is the for the vast majority of cloud workloads, first to add programmable acceleration to a without the development of any application- core at-scale hyperscale cloud. All new Bing specific FPGA code. Hardware services can be and Microsoft Azure cloud servers are now shared by multiple hosts, improving the eco- deployed as Configurable Cloud nodes. nomics of accelerator deployment. Moreover, The key difference with our previous work1 this design choice essentially turns the distrib- is that this architecture replaces the dedi- uted FPGA resources into an independent cated FPGA network with a tight coupling plane of computation in the datacenter, at the between each FPGA and the datacenter net- same scale as the servers. Figure 1a shows a work. Each FPGA is a “bump in the wire” logical view of this plane of computation. between the servers’ NICs and the Ethernet This model offers significant flexibility, free- network switches (see Figure 1b). All net- ing services from having a fixed ratio of CPU work traffic is routed through the FPGA, cores to FPGAs, and instead allowing inde- which enables significant workload flexibil- pendent allocation of each type of resource. ity, while a local PCI Express (PCIe) con- The Configurable Cloud architecture’s dis- nection maintains the local computation tributed nature lets accelerators be imple- accelerator use case and provides local man- mented anywhere in the Hyperscale cloud, agement functionality. Although this change including at the edge. Services can also be to the network design might seem minor, the easily reached by any other node in the cloud impact to the types of workloads that can be directly through the network, enabling services ...... 54 IEEE MICRO to be implemented at any location in the worldwide datacenter network. Microsoft has deployed this new architec- ture to most of its new datacenter servers. 7.0 Although the actual production scale of this 6.0 deployment is orders of magnitude larger, for 99.9% software latency this article we evaluate the Configurable 5.0 Cloud architecture using a bed of 5,760 serv- ers deployed in a production datacenter. 4.0

Usage Models 3.0

The Configurable Cloud is sufficiently flexible 99.9% FPGA latency Average 2.0 to cover three scenarios: local acceleration Normalized load and latency software load (through PCIe), network acceleration, and Average FPGA query load global application acceleration through pools 1.0 of remotely accessible FPGAs. Local accelera- tion handles high-value scenarios such as web Day 1 Day 2 Day 3 Day 4 Day 5 search ranking acceleration, in which every server can benefit from having its own FPGA. Figure 2. Five-day query throughput and latency of ranking service queries Network acceleration supports services such as running in production, with and without FPGAs enabled. software-defined networking, intrusion detec- tion, deep packet inspection, and network encryption, which are critical to infrastructure as a service (for example, “rental” of cloud these machines using the FPGA for local servers),andwhichhavesuchahugediversity acceleration, and the rest used for other func- of customers that it makes it difficult to justify tions associated with web search. Unlike in local acceleration alone economically. Global our previous work, we implemented only the acceleration permits acceleration hardware most expensive feature calculations and omit- unused by its host servers to be made available ted less-expensive feature calculations, the for other hardware services—for example, post-processed synthetic features, and the large-scale applications such as machine learn- machine-learning calculations. ing. This decoupling of a 1:1 ratio of servers to Figure 2 shows the performance of search FPGAsisessentialforbreakingthechicken- ranking running in production over a five- and-egg problem in which accelerators cannot day period, with and without FPGA accelera- be added until enough applications need tion. The top two lines show the normalized them, but applications will not rely on the tail query latencies at the 99.9th percentile accelerators until they are present in the infra- (aggregated across all servers over a rolling structure. By decoupling the servers and time window), and the bottom two lines FPGAs, software services that demand more show the corresponding query loads received FPGA capacity can harness spare FPGAs from at each datacenter. other services that are slower to adopt (or do Because load varies throughout the day, not require) the accelerator fabric. the queries executed without FPGAs experi- We measure the system’s performance enced a higher latency with more frequent characteristics using web search to represent latency spikes, whereas the FPGA-accelerated local acceleration, network flow encryption queries had much lower, tighter-bound laten- and network flow offload to represent net- cies. This is particularly impressive given the work acceleration, and machine learning to much higher query loads experienced by the represent global acceleration. FPGA-accelerated machines. The higher query load, which was initially unexpected, Local Acceleration was due to the top-level load balancers select- We brought up a production Bing web search ing FPGA-accelerated machines over those ranking service on the servers, with 3,081 of without FPGAs due to the lower and less- ...... MAY/JUNE 2017 55 ...... TOP PICKS

variable latency. The FPGA-accelerated The same Configurable Cloud architec- machines were better at serving query traffic ture has also been used to accelerate software- even at higher load, and hence were assigned defined networking,3 in which bulk packet additional queries, nearly twice the load that operations are offloaded to the FPGA under software alone could handle. the software’s policy control. The initial implementation gave Microsoft Azure the Infrastructure Acceleration fastest public cloud network, with 25 ls Although local acceleration such as Bing latency and 25 Gbps of bandwidth. This search was the primary motivation for service was offered free to third-party users, deploying FPGAs into Microsoft’s datacen- who can benefit from FPGA acceleration ters at hyperscale, the level of effort required without writing any code for the FPGA. to port the Bing software stack to FPGAs is not sustainable for the more than 200 Hardware as a Service: Shared, Remote first-party workloads currently deployed in Accelerators Microsoft datacenters, not to mention the Most workloads can benefit from local accel- thousands of third-party applications run- eration, infrastructure acceleration, or both. ning on Microsoft Azure. The architecture There are two key motivations behind ena- needs to provide benefit to workloads that bling remote hardware services: hardware have no specialized code for running on accelerators should be placed flexibly any- FPGAs by accelerating components common where in the hyperscale datacenter network, across many different workloads. We call this and the resource requirements for the hard- widely applicable acceleration infrastructure ware fabric (FPGAs) should scale commen- acceleration. Our first example is acceleration surately with software demand (CPU), not of network services. just one server to one FPGA. Nearly all hyperscale workloads rely on a To address the first requirement, remote fast, reliable, and secure network between accelerators are never more than a few net- many machines. By accelerating network work hops away from any other server across processing in ways such as protocol offload hundreds of thousands to millions of servers. and host-to-host network crypto, the FPGAs Since each server has acceleration capabilities, can accelerate workloads that have no specific any accelerator can be mapped to any tuning for FPGA offload. This is increasingly location. important as growing network speeds put Similarly, some software services have greater pressure on CPU cores trying to keep underutilized FPGA resources, while others up with protocol processing and line-rate need more than one. This architecture lets all crypto. accelerators in the datacenter communicate In the case of network crypto, each packet directly, enabling harvesting of FPGAs from is examined and encrypted or decrypted at the deployment for services with greater line rate as necessary while passing from the needs. This also allows the allocation of thou- NIC through the FPGA to the network sands of FPGAs for a single job or service, switch. Thus, once an encrypted flow is set independent of their CPU hosts and without up, no CPU usage is required to encrypt or impacting the CPU’s performance, in effect decrypt the packets. Encryption occurs creating a new kind of computer embedded transparently from the perspective of the in the datacenter. A demonstration of this software, which sees unencrypted packets at pooled FPGA capability at Microsoft’s Ignite the endpoints. Network encryption/decryp- conference in 2016 showed that four har- tion offload yields significant CPU savings. vested FPGAs translated 5.2 million Wikipe- Our AES-128 implementations support a dia articles from English to Spanish five full 40 Gbits per second (GBps) of encryp- orders of magnitude faster than a 24-core tion/decryption with no load on the CPU server running highly tuned vectorized code.4 beyond setup and teardown. Achieving the We developed a custom FPGA-to-FPGA same performance in software requires network protocol called the Lightweight more than four CPU cores running at full Transport Layer (LTL), which uses the User utilization. Datagram Protocol for frame encapsulation ...... 56 IEEE MICRO and Internet Protocol for routing packets across the datacenter network. Low-latency 5.0 Avg communication demands infrequent packet drops and infrequent packet reorders. By using 4.0 95% lossless traffic classes provided in datacenter 99% switches and provisioned for traffic (such as 3.0 Remote Direct Memory Access and Fiber

Channel over Ethernet), we avoid most packet Latency 2.0 drops and reorders. Separating out such traffic to its own classes also protects the datacenter’s 1.0 baseline TCP traffic. Because the FPGAs are normalized to local FPGA so tightly coupled to the network, they can 0.0 0.5 1.0 1.5 2.0 2.5 3.0 react quickly and efficiently to congestion Oversubscription: No. of remote clients/No. of FPGAs notification and back off when needed to reduce packets dropped from in-cast patterns. Figure 3. Average, 95th, and 99th percentile latencies to a remote Deep At the endpoints, the LTL protocol engine Neural Network (DNN) accelerator (normalized to locally attached uses an ordered, reliable connection-based performance in each latency category). interface with statically allocated, persistent connections, realized using send and receive connection tables. The static allocation and percent at the 99th percentile. As expected, persistence (until they are deallocated) reduce contention and queuing delay increase as latency for inter-FPGA and inter-service mes- oversubscription increases. Eventually, the saging, because once established, they can FPGA reaches its peak throughput and satu- communicate with low latency. Reliable mes- rates. In this case study, each individual FPGA saging also reduces protocol latency. Although has sufficient throughput to comfortably sup- datacenter networks are already fairly reliable, port roughly two clients even at artificially LTL provides a strong reliability guarantee high traffic levels, demonstrating the feasibil- via an ACK/NACK-based retransmission ity of sharing accelerators and freeing resour- scheme. When packet reordering is detected, ces for other functions. NACKs are used to request timely retrans- Managing remote accelerators requires mission of particular packets without waiting significant hardware and software support. A for a time-out. complete overview of the management of our To evaluate LTL and resource sharing, hardware fabric, called Hardware as a Service we implemented and deployed a latency- (HaaS), is beyond the scope of this article, optimizedDeepNeuralNetwork(DNN) but we provide a short overview of the plat- accelerator developed for natural language form here. HaaS manages FPGAs in a man- processing, and we used a synthetic stress ner similar to Yarn5 and other job schedulers. test to simulate DNN request traffic at A logically centralized resource manager varying levels of oversubscription. By increas- (RM) tracks FPGA resources throughout the ing the ratio of clients to accelerators (by datacenter. The RM provides simple APIs for removing FPGAs from the pool), we measure higher-level service managers (SMs) to easily the impact on latency due to oversubscrip- manage FPGA-based hardware components tion. The synthetic stress test generated by a through a lease-based model. Each compo- single software client is calibrated to generate nent is an instance of a hardware service at least twice the worst-case traffic expected in made up of one or more FPGAs and a set of production (thus, even with 1:1 oversubscrip- constraints (for example, locality or band- tion, the offered load and latencies are highly width). SMs manage service-level tasks, such conservative). as load balancing, intercomponent connec- Figure 3 shows request latencies as over- tivity, and failure handling, by requesting and subscription increases. In the 1:1 case (no releasing component leases through the RM. oversubscription), remote access adds less An SM provides pointers to the hardware than 4.7 percent additional latency to each service to one or more users to take advantage request up to the 95th percentile, and 32 of the hardware acceleration. An FPGA ...... MAY/JUNE 2017 57 ...... TOP PICKS

network, as was done in previous work, but this configuration also lets us wire the FPGA as a “bump in the wire,” sitting between the 40G QSFP ports NIC and the top-of-rack (ToR) switch. (NIC and TOR) Rather than cabling the standard NIC V directly to the ToR, the NIC is cabled to one D5 FPGA port of the FPGA, and the other FPGA port is cabled to the ToR (see Figure 1b). Maintaining the discrete NIC in the sys- tem lets us leverage all the existing network 4-Gbyte DDR3 offload and packet transport functionality hardened into the NIC. This minimizes the code required to deploy FPGAs to simple Figure 4. Photograph of the manufactured board. The DDR channel is bypass logic. In addition, both FPGA resour- implemented using discrete components. PCI Express connectivity goes ces and PCIe bandwidth are preserved for through a mezzanine connector on the bottom side of the board (not acceleration functionality, rather than being shown). spent on implementing the NIC in soft logic. One potential drawback to the bump-in- the-wire architecture is that an FPGA failure, manager (FM) runs on each node to provide such as loading a buggy application, could configuration support and status monitoring cut off network traffic to the server, rendering for the system. the server unreachable. However, unlike in a ring or torus network, failures in the bump- Hardware Details in-the-wire architecture do not degrade any In addition to the architectural requirement neighboring FPGAs, making the overall sys- to provide sufficient flexibility to justify scale tem more resilient to failures. In addition, production deployment, there are also physi- most datacenter servers (including ours) have cal restrictions in current infrastructures that a side-channel management path that exists must be overcome. These restrictions include to power servers on and off. By policy, the strict power limits, a small physical space in known-good golden image that loads on which to fit, resilience to hardware failures, power up is rarely (if ever) overwritten, so the and tolerance to high temperatures. For exam- management network can always recover the ple, the accelerator architecture we describe is FPGA with a known-good configuration, the widely used OpenCompute server, which making the server reachable via the network constrained power to 35 W, the physical size once again. to roughly a half-height, half-length PCIe In addition, FPGAs have proven to be expansion card, and tolerance to an inlet air reliable and resilient at hyperscale, with only temperature of 70C at low airflow. 0.03 percent of boards failing across our We designed the accelerator board as a deployment after one month of processing stand-alone FPGA board that is added to the full-production traffic, and with all failures PCIe expansion slot in a production server happening at the beginning of production. configuration. Figure 4 shows a photograph Bit flips in the configuration layer were meas- of the board with major components labeled. ured at an average rate of one per 1,025 The FPGA is an Stratix V D5, with machine days. We used configuration layer 172,600 adaptive logic modules of program- monitoring and correcting circuitry to mini- mable logic, 4 Gbytes of DDR3-1600 mize the impact of these single-event upsets. DRAM, two independent PCIe Gen 3 x8 Given aggregate datacenter failure rates, we connections, and two independent 40 Giga- deemed the FPGA-related hardware failures bit Ethernet interfaces. The realistic power to be acceptably low for production. draw of the card under worst-case environ- mental conditions is 35 W. he Configurable Cloud architecture is a The dual 40 Gigabit Ethernet interfaces T major advance in the way datacenters on the board could allow for a private FPGA are being designed and utilized. The impact ...... 58 IEEE MICRO of this design goes far beyond just an improved that architecture is being used to improve network design. Bing search quality and performance and The collective architecture and software Azure networking capabilities and perform- protocols described in this article can be seen ance. Those deployments and the results of as a fundamental shift in the role of CPUs in accelerating applications on them provide fur- the datacenter. Rather than the CPU control- ther confirmation of programmable accelera- ling every task, the FPGA is now the gate- tors’ value for datacenter services. keeper between the server and the network, FPGA architectures are being heavily determining how incoming and outgoing influenced by this work. Investment into data will be processed and handling common datacenter FPGA technology and ecosystems cases quickly and efficiently. In such a model, by Microsoft and other major companies is the FPGA calls the CPU to handle infrequent increasing, not least being Intel’s acquisition and/or complex work that the FPGA cannot of Altera for $16.7 billion, as well as the handle itself. Such an architecture adds recent introduction of FPGAs at a limited another mode of operation to a traditional scale by the majority of the other major cloud , potentially removing providers. MICRO the CPU as the machine’s master. Such an organization can be viewed as relegating the ...... CPU to be a complexity offload engine for References the FPGA. 1. A. Putnam et al., “A Reconfigurable Fabric Of course, there will be many applications for Accelerating Large-Scale Datacenter in which the CPUs handle the bulk of the Services,” Proc. 41st Ann. Int’l Symp Com- computation. In that case, the FPGAs puter Architecture, 2014, pp. 13–24. attached to those CPUs can be used over the 2. A.M. Caulfield et al., “A Cloud-Scale Accel- network by other applications running on eration Architecture,” Proc. 49th Ann. IEEE/ different servers that need more FPGA ACM Int’l Symp. Microarchitecture, 2016; resources in a few tens of microseconds across doi:10.1109/MICRO.2016.7783710. the datacenter. 3. Y. Khalidi et al., “Microsoft Azure Network- Such a Configurable Cloud provides enor- ing: New Network Services, Features and mous flexibility in how computation is done Scenarios,” Microsoft Ignite, 2016; http:// and in the placement of computational tasks, channel9.msdn.com/Events/Ignite/2016 enabling the right computational unit to be /BRK3237-TS. assigned a particular task at the right time. 4. S. Nadella, “Innovation Keynote,” Microsoft Thus, additional efficiency can be extracted Ignite, 2016; http://channel9.msdn.com from the hardware, allowing accelerators to be /Events/Ignite/2016/KEY02. shared and underutilized resources to be reclaimed and repurposed independently of 5. V.K. Vavilapalli et al., “Apache Hadoop Yarn: the other resources. In addition, by distribut- Yet Another Resource Negotiator,” Proc. ing heterogeneous accelerators throughout the 4th Ann. Symp. Cloud Computing, 2013, pp. network, this architecture avoids the network 5:1–5:16. bottlenecks, cost, and complexities of bolt-on clusters of specialized accelerator nodes. Adrian M. Caulfield is a principal research As a result, Configurable Clouds enable hardware development engineer at Micro- the performant implementation of wildly dif- soft Research. His research interests include ferent functions on exactly the same hard- computer architecture and reconfigurable ware. In addition, specialized hardware is far computing. Caulfield received a PhD in more efficient than CPUs, making Configu- computer engineering from the University rable Clouds better for the environment. of California, San Diego. Contact him at This work has already had significant [email protected]. impact. Microsoft is building the vast major- ity of its next-generation datacenters across 15 Eric S. Chung is a researcher at Microsoft countries and 5 continents using this architec- Research NExT, where he leads an effort to ture. Microsoft has publicly described how accelerate large-scale machine learning using ...... MAY/JUNE 2017 59 ...... TOP PICKS

FPGAs. Chung received a PhD in electrical from the College of New Jersey (formerly and computer engineering from Carnegie Trenton State College). Contact him at Mellon University. Contact him at erchung@ [email protected]. microsoft.com. Matt Humphrey is a principal engineer at Andrew Putnam is a principal research Microsoft, where he works on Azure. His hardware development engineer at Microsoft research interests include analog and digital Research NExT. His research interests include electronics and the architecture of high-scale , future datacenter distributed software systems. Humphrey design, and computer architecture. Putnam received an MS in electrical engineering received a PhD in computer science and engi- from the Georgia Institute of Technology. neering from the University of Washington. Contact him at [email protected]. Contact him at [email protected]. Puneet Kaur is a principal software engineer Hari Angepat is a senior software engineer at at Microsoft. Her research interests include Microsoft and a PhD candidate at the Univer- distributed systems and scalability. Kaur sity of Texas at Austin. His research interests received a MTech in computer science from include novel FPGA accelerator architectures, the Indian Institute of Technology, Kanpur. hardware/software codesign techniques, and Contact her at [email protected]. approaches for flexible hardware design. Ange- patreceivedanMSincomputerengineering Joo-Young Kim is a senior research hard- from the University of Texas at Austin. Con- ware development engineer at Microsoft tact him at [email protected]. Research. His research interests include high-performance, energy-efficient computer Daniel Firestone is the tech lead and man- architectures for various datacenter work- ager for the Azure Networking Host SDN loads, such as data compression, video trans- team at Microsoft. His team builds the coding, and machine learning. Kim received a Azure virtual switch and SmartNIC. Con- PhD in electrical engineering from Korea tact him at [email protected]. Advanced Institute of Science and Technol- ogy (KAIST). Contact him at jooyoung@ Jeremy Fowers is a senior research hardware microsoft.com. design engineer in the Catapult team at Microsoft Research NeXT. He specializes in Daniel Lo is a research hardware develop- the design and implementation of FPGA ment engineer at Microsoft Research. His accelerators across a variety of application research interests include designing high- domains, and is currently focused on machine performance accelerators on FPGAs. Lo learning. Fowers received a PhD in electrical received a PhD in electrical and computer engineering from the University of Florida. engineering from Cornell University. Con- Contact him at [email protected]. tact him at [email protected].

Michael Haselman is a senior software engi- Todd Massengill is a senior hardware design neer at Microsoft ASG (Bing). His research engineer at Microsoft Research. His research interests include FPGAs, computer architec- interests include hardware acceleration of arti- ture, and distributed computing. Haselman ficial intelligence applications, biologically received a PhD in electrical engineering from inspired computing, and tools to improve the University of Washington. Contact him hardware engineering design, collaboration, at [email protected]. and efficiency. Massengill received an MS in electrical engineering from the University of Stephen Heil is a principal program man- Washington. Contact him at toddma@ ager at Microsoft Research. His research inter- microsoft.com. ests include field-programmable gate arrays, application accelerators, and rack-scale system Kalin Ovtcharov is a research hardware design. Heil received a BS in electrical engi- development engineer at Microsoft Research neering technology and computer science NeXT. His research interests include ...... 60 IEEE MICRO accelerating computationally intensive tasks systems, machine learning, and reconfigura- on FPGAs in areas such as machine learn- ble hardware in datacenters. Lanka received ing, image processing, and video compres- a PhD in computer science from the Uni- sion. Ovtcharov received a BS in computer versity of Pennsylvania. Contact him at engineering from McMaster University. [email protected]. Contact him at [email protected]. Derek Chiou is a partner hardware group Michael Papamichael is a researcher at engineering manager at Microsoft and a Microsoft Research, where he’s working on research scientist at the University of Texas the Catapult project. His research interests at Austin. His research interests include accel- include hardware acceleration, reconfigurable erating datacenter applications and infrastruc- computing, on-chip interconnects, and meth- ture, rapid system design, and fast, accurate odologies to facilitate hardware specialization. simulation. Chiou received a PhD in electri- Papamichael received a PhD in computer sci- cal engineering and computer science from ence from Carnegie Mellon University. Con- the Massachusetts Institute of Technology. tact him at [email protected]. Contact him at [email protected].

Lisa Woods is a principal program manager Doug Burger is a distinguished engineer at for the Catapult project at Microsoft Microsoft, where he leads a Disruptive Sys- Research, where she focuses on driving strate- tems team at Microsoft Research NExT and gic alignment between the Catapult team and cofounded Project Catapult. Burger received a its many internal and external partners as well PhD in computer science from the University as scalability and execution for the project. of Wisconsin. He is an IEEE and ACM Fel- Woods received an MS in computer science. low. Contact him at [email protected]. Contact her at [email protected].

Sitaram Lanka is a partner group engineer- ing manager in Search Platform at Micro- Read your subscriptions through soft Bing. His research interests include web the myCS publications portal at http://mycs.computer.org. and enterprise search, large-scale distributed

...... MAY/JUNE 2017 61 ...... SPECIALIZING A PLANET’S COMPUTATION: ASIC CLOUDS

......

ASIC CLOUDS, A NATURAL EVOLUTION TO CPU- AND GPU-BASED CLOUDS, ARE

PURPOSE-BUILT DATACENTERS FILLED WITH ASIC ACCELERATORS.ASICCLOUDS MAY

SEEM IMPROBABLE DUE TO HIGH NON-RECURRING ENGINEERING (NRE) COSTS AND ASIC

INFLEXIBILITY, BUT LARGE-SCALE BITCOIN ASIC CLOUDS ALREADY EXIST.THIS ARTICLE

DISTILLS LESSONS FROM THESE PRIMORDIAL ASIC CLOUDS AND PROPOSES NEW

PLANET-SCALE YOUTUBE-STYLE VIDEO-TRANSCODING AND DEEP LEARNING ASIC CLOUDS,

SHOWING SUPERIOR TOTAL COST OF OWNERSHIP.ASICCLOUD NRE AND ECONOMICS

ARE ALSO EXAMINED.

Moein Khazraee ...... In the past 10 years, two parallel built datacenters comprising large arrays of phase changes in the computational landscape ASIC accelerators. ASIC Clouds are not Luis Vega Gutierrez have emerged. The first change is the bifurca- ASIC supercomputers that scale up problem tion of computation into two sectors—cloud sizes for a single tightly coupled computation; Ikuo Magaki and mobile. The second change is the rise of rather, they target workloads comprising dark silicon and dark-silicon-aware design many independent but similar jobs. Michael Bedford Taylor techniques, such as specialization and near- As more and more services are built threshold computation.1 Recently, researchers around the Cloud model, we see the emer- University of California, and industry have started to examine the con- genceofplanet-scaleworkloadsinwhich junction of these two phase changes. Baidu datacenters are performing the same compu- San Diego has developed GPU-based clouds for distrib- tation across many users. For example, con- uted neural network accelerators, and Micro- sider Facebook’s face recognition of uploaded soft has deployed clouds based on field- pictures, or Apple’s Siri voice recognition, or programmable gate arrays (FPGAs) for Bing. the Internal Revenue Service performing tax At a single-node level, we know that appli- audits with neural nets. Such scale-out work- cation-specific integrated circuits (ASICs) can loads can easily leverage racks of ASIC servers offer order-of-magnitude improvements in containing arrays of chips that in turn con- energy efficiency and cost performance over nect arrays of replicated compute accelera- CPU, GPU, and FPGA by specializing silicon tors (RCAs) on an on-chip network. The for a particular computation. Our research large scale of these workloads creates the eco- proposes ASIC Clouds,2 which are purpose- nomic justification to pay the non-recurring ......

62 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE Replicated Control compute DC/DC PCI-Express plane units converters 1/10/100 Gigabit Ethernet DRAM On-PCB controller router Controller

1 U 1 U

RCA PSU

1 U ASICs 1 U Fans 1 U

ASIC Server 42U-rack Machine room

Figure 1. High-level abstract architecture of an ASIC Cloud. Specialized replicated compute accelerators (RCAs) are multiplied up by having multiple copies per application-specific integrated circuit (ASIC), multiple ASICs per server, multiple servers per rack, and multiple racks per datacenter. Server controller can be a field-programmable gate array (FPGA), , or a Xeon processor. The power delivery and cooling system are customized based on ASIC needs. If required, there would be DRAMs on the printed circuit board (PCB) as well. (PSU: power supply unit.) engineering (NRE) costs of ASIC develop- propose a new rule that explains when it ment and deployment. As a workload grows, makes sense to design and deploy an ASIC the ASIC Cloud can be scaled in the datacen- Cloud, considering NRE. ter by adding more ASIC servers, unlike accel- erators in, say, a mobile phone population,3 in which the accelerator/processor mix is fixed at ASIC Cloud Architecture tape out. At the heart of any ASIC Cloud is an energy- Our research examines ASIC Clouds in efficient, high-performance, specialized RCA the context of four key applications that that is multiplied up by having multiple cop- show great potential for ASIC Clouds, ies per ASIC, multiple ASICs per server, including YouTube-style video transcoding, multiple servers per rack, and multiple racks Bitcoin and Litecoin mining, and deep learn- per datacenter (see Figure 1). Work requests ing. ASICs achieve large reductions in silicon from outside the datacenter will be distrib- area and energy consumption versus CPUs, uted across these RCAs in a scale-out fashion. GPUs, and FPGAs. We specialize the ASIC All system components can be customized server to maximize efficiency, employing for the application to minimize TCO. optimized ASICs, a customized printed cir- Each ASIC interconnects its RCAs using a cuit board (PCB), custom-designed cooling customized on-chip network. The ASIC’s con- systems and specialized power delivery sys- trol plane unit also connects to this network tems, and tailored DRAM and I/O subsys- and schedules incoming work from the ASIC’s tems. ASIC voltages are customized to tweak off-chip router onto the RCAs. Next, the pack- energy efficiency and minimize total cost of aged ASICs are arranged in lanes on a custom- ownership (TCO). The datacenter itself can ized PCB and connected to a controller that also be specialized, optimizing rack-level and bridges to the off-PCB interface (1 to 100 Gig- datacenter-level thermals and power delivery abit Ethernet, Remote Direct Memory Access, to exploit the knowledge of the computation. and PCI Express). In some cases, DRAMs can We developed tools that consider all aspects connect directly to the ASICs. The controller of ASIC Cloud design in a bottom-up way, can be implemented by an FPGA, a microcon- and methodologies that reveal how the troller, or a Xeon processor. It schedules designers of these novel systems can optimize remote procedure calls (RPCs) that come TCO in real-world ASIC Clouds. Finally, we from the off-PCB interface on to the ASICs...... MAY/JUNE 2017 63 ...... TOP PICKS

using ducts results in better cooling perform- DRAM ance compared to conventional or staggered or I/O layout. The PCB, fans, and power supply On-chip intensity are enclosed in a 1U server, which is then RAM Litecoin assembled into racks in a datacenter. Based intensity Deep learning on ASIC needs, the power supply unit (PSU) and DC/DC converters are customized for each server. Latency Video sensitivity The “Evaluating an ASIC Server Config- Xcode uration” sidebar shows our automated meth- odology for designing a complete ASIC Cloud system. Bitcoin Application Case Study On-chip logic To explore ASIC Clouds across a range of intensity accelerator properties, we examined four applications that span a diverse range of Figure 2. Accelerator properties. We explore applications with diverse properties—namely, Bitcoin mining, Lite- requirements. coin mining, video transcoding, and deep learning (see Figure 2). Perhaps the most critical of these applica- tions is Bitcoin mining. Our inspiration for 100,000,000,000 16 50,000,000,000 20 ASIC Clouds came from our intensive study 20,000,000,000 Difficulty 10,000,000,000 4 5,000,000,000 of Bitcoin mining clouds, which are one of 2,000,000,000 22 1,000,000,000 the first known instances of a real-life ASIC 500,000,000 28 200,000,000 Cloud. Figure 3 shows the massive scale out 100,000,000 55 50,000,000 65 of the Bitcoin-mining workload, which is 20,000,000 110 10,000,000 now operating at the performance of 3.2 bil- 5,000,000 FPGA 130 2,000,000 lion GPUs. Bitcoin clouds have undergone a 1,000,000 500,000 rapid ramp from CPU to GPU to FPGA to 200,000 100,000 50,000 the most advanced ASIC technology avail- 20,000 10,000 GPU able today. Bitcoin is a logic-intensive design 5,000 2,000 that has high power density and no need for 1,000 500 static RAM (SRAM) or external DRAM. 200 100 50 Litecoin is another popular cryptocur- 20 CPU 10 rency mining system that has been deployed 5 2 into clouds. Unlike Bitcoin, it is an SRAM- 1 ’09 ’10 ’11 ’12 ’13 ’14 ’15 ’16 intensive application with low power density. Video transcoding, which converts from Figure 3. Evolution of specialization: Bitcoin cryptocurrency mining clouds. one video format to another, currently takes Numbers are ASIC nodes, in nanometers, which annotate the first date of almost 30 high-end Xeon servers to do in release of a miner on that technology. Difficulty is the ratio of the total Bitcoin real time. Because every cell phone and hash throughput of the world, relative to the initial mining network throughput, Internet of Things device can easily be a which was 7.15 MH per second. In the six-year period preceding November video source, it has the potential to be an 2015, the throughput increased by a factor of 50 billion times, corresponding to a unimaginably large planet-scale computation. world hash rate of approximately 575 million GH per second. Video transcoding is an external memory- intensive application that needs DRAMs Depending on the application, it can imple- next to each ASIC. It also requires high off- ment the non-acceleratable part of the work- PCB bandwidth. load or perform UDP/TCP-IP offload. Finally, deep learning is extremely com- Each lane is enclosed by a duct and has a putationally intensive and is likely to be dedicated fan blowing air through it across used by every human on the planet. It is often the ASIC heatsinks. Our simulations indicate latency sensitive, so our Deep Learning neural ...... 64 IEEE MICRO Evaluating an ASIC Server Configuration Our ASIC Cloud server configuration evaluator, shown in Figure dynamics (CFD) simulations in Figure A2. Our simulations show A1, starts with a implementation of an accelerator, or a that breaking a fixed heat source into smaller ones with the same detailed evaluation of the accelerator’s properties from the total heat output improves the mixing of warm and cold areas, research literature. In the design of an ASIC server, we must resulting in lower temperatures. Using thermal optimization tech- decide how many chips should be placed on the printed circuit niques, we established a fundamental connection between an board (PCB) and how large, in mm2 of silicon, each chip should be. RCA’s properties, the number of RCAs placed in an ASIC, and how The size of each chip determines how many replicated compute many ASICs go on a PCB in a server. Given these properties, our accelerators (RCAs) will be on each chip. In each duct-enclosed heat sink solver determines the optimal heat sink configuration. lane of ASIC chips, each chip receives around the same amount of Results are validated with the CFD simulator. In the “Design airflow from the intake fans, but the most downstream chip Space Evaluation” sidebar, we show how we apply this evaluation receives the hottest air, which includes the waste heat from the flow across the design space to determine TCO and Pareto- other chips. Therefore, the thermally bottlenecking ASIC is the optimal points that trade off cost per operation per second (ops/s) one in the back, shown in our detailed computational fluid and watts per ops/s.

Core voltage Design specification Temperature (°C) 87.8606 Voltage scaling 80.6280 73.3954

66.1628 W/mm2 GHash/s/mm2 58.9302 51.6976 ASIC arrangement 44.4650 Power 37.2324 Maximum power 29.9998 budgeting policy input

Choose an optimal heatsink

Die-size optimization

ASIC power budgeting

Die size GHash/s/server Voltage W/server Frequency $/server No. of ASICs MHash/J

(1) (2)

Figure A. ASIC server evaluation flow. (1) The server cost, per server hash rate, and the energy efficiency are evaluated using replicated compute accelerator (RCA) properties and a flow that optimizes server heatsinks, die size, voltage, and power density. (2) Thermal verification of an ASIC Cloud server using Computational Fluid Dynamics tools to validate the flow results. The farthest ASIC from the fan has the highest temperature and is the bottleneck for power per ASIC at a fixed voltage and energy efficiency. net accelerator has a tight low-latency service- and-routed designs in UMC 28 nm using level agreement. IC compiler and analysis tools For our Bitcoin and Litecoin studies, we (such as PrimeTime). For deep learning and developed the RCA and got the required video transcoding, we extracted properties parameters such as gate count from placed- from accelerators in the research literature...... MAY/JUNE 2017 65 ...... TOP PICKS

Design Space Exploration After all thermal constraints were in place, we optimized ASIC based on the Pareto results. The leaf tier comprises various expert server design targeting two conventional key metrics—namely, solvers that compute the optimal properties of the server compo- cost per ops/s and power per ops/s—and then applied TCO analy- nents—for example, CFD simulations for heat sinks, DC-DC con- sis. TCO analysis incorporates the datacenter-level constraints, verter allocation, circuit area/delay/voltage/energy estimators, including the cost of power delivery inside the datacenter, land, and DRAM property simulation. In many cases, these solvers depreciation, interest, and the cost of energy itself. With these export their data as large tables of memoized numbers for every tools, we can correctly weight these two metrics and find the over- component. all optimal point (TCO-optimal) for the ASIC Cloud.

Design-space exploration is application dependent, and there are No. of DRAMs per ASIC frequently additional constraints. For example, for the video trans- 70 1 DRAM 2 DRAMs 3 DRAMs coding application, we model the PCB real estate occupied by these 4 DRAMs 5 DRAMs DRAMs, which are placed on either side of the ASIC they connect to, 6 DRAMs 60 perpendicular to airflow. As the number of DRAMs increases, the number of ASICs placed in a lane decreases for space reasons. We 50 model the more expensive PCBs required by DRAM, with more layers $/OP/s and better signal/power integrity. We employ two 10-Gigabit Ether- net ports as the off-PCB interface for network-intensive clouds, and 40 we model the area and power of the memory controllers.

Our ASIC Cloud infrastructure explores a comprehensive 30

design space, including DRAMs per ASIC, logic voltage, area per 9 10 11 12 13 14 15 ASIC, and number of chips. DRAM cost and power overhead are W/OP/s significant, and so the Pareto-optimal video transcoding designs ensure DRAM bandwidth is saturated, and link chip performance Figure B. Pareto curve example for video transcoding. to DRAM count. As voltage and frequency are lowered, area Exploring different numbers of DRAMs per ASIC and increases to meet the performance requirement. Figure B shows logic voltage for optimal TCO per performance point. the video transcoding Pareto curve for five ASICs per lane and dif- Voltage increases from left to right. Diagonal lines show ferent numbers of DRAMs per ASIC. The tool comprises two tiers. equal TCO per performance values; the closer to the The top tier uses brute force to explore all possible configurations origin, the lower the TCO per performance. This plot is to find the energy-optimal, cost-optimal, and TCO-optimal points for five ASICs per lane.

Results increases DRAMs per lane, 30, to improve Table 1 gives details of optimal server config- performance, but is still close to the optimal urations for energy-, TCO-, and cost-optimal energy efficiency at 0.75 V, resulting in a die designs for each application. The “Design size and frequency between the other two Space Exploration” sidebar explains how these optimal points. optimal configurations are determined. Figure 4 compares the performance of For example, for video transcoding, the CPU Clouds, GPU Clouds, and ASIC Clouds cost-optimal server packs the maximum for the four applications that we presented. number of DRAMs per lane, 36, which max- ASIC Clouds outperform CPU Clouds’ TCO imizes performance. However, increasing the per operations per second (ops/s) by 6,270, number of DRAMs per ASIC requires higher 704, and 8,695 times for Bitcoin, Litecoin, logic voltage (1.34 V) and corresponding fre- and video transcoding, respectively. ASIC quencies to attain performance within the Clouds outperform GPU Clouds’ TCO per maximum die area constraint, resulting in ops/s by 1,057, 155, and 199 times for Bit- less-energy-efficient designs. Hence, the coin, Litecoin, and deep learning, respectively. energy-optimal design has fewer DRAMs per ASIC and per lane (24), but it gains back ASIC Cloud Feasibility: The Two-for-Two some performance by increasing ASICs per Rule lane, which is possible due to lower power When does it make sense to design and density at 0.54 V. The TCO-optimal design deploy an ASIC Cloud? The key barrier is ...... 66 IEEE MICRO Table 1. ASIC Cloud optimization results for four applications: (a) Bitcoin, (b) Litecoin, (c) video transcoding, and (d) deep learning.

Property Energy optimal server TCO optimal server Cost optimal server

ASICs per server 120 72 24 Logic voltage (V) 0.400 0.459 0.594 Clock frequency (MHz) 71 149 435 Die area (mm2) 599 540 240 GH per second (GH/s) per server 7,292 8,223 3,451 W per server 2,645 3,736 2,513 Cost ($) per server 12,454 8,176 2,458 W per GH/s 0.363 0.454 0.728 Cost ($) per GH/s 1.708 0.994 0.712 Total cost of ownership (TCO) per GH/s 3.344 2.912 3.686 (a)

ASICs per server 120 120 72 Logic voltage (V) 0.459 0.656 0.866 Clock frequency (MHz) 152 576 823 Die area (mm2) 600 540 420 MH/s per server 405 1,384 916 W per server 783 3,662 3,766 $ per server 10,971 11,156 6,050 W per MH/s 1.934 2.645 4.113 $ per MH/s 27.09 8.059 6.607 TCO per MH/s 37.87 19.49 23.70 (b)

DRAMs per ASIC 3 6 9 ASICs per server 64 40 32 Logic voltage (V) 0.538 0.754 1.339 Clock frequency (MHz) 183 429 600 Die area (mm2) 564 498 543 Kilo frames per second (Kfps) per server 126 158 189 W per server 1,146 1,633 3,101 $ per server 7,289 5,300 5,591 W per Kfps 9.073 10.34 16.37 $ per Kfps 57.68 33.56 29.52 TCO per Kfps 100.3 78.46 97.91 (c)

Chip type 4 Â 22Â 22Â 1 ASICs per server 32 64 96 Logic voltage (V) 0.900 0.900 0.900 Clock frequency (MHz) 606 606 606 Tera-operations per second (Tops/s) per server 470 470 353 W per server 3,278 3,493 2,971 $ per server 7,809 6,228 4,146 W per Tops/s per server 6.975 7.431 8.416 $ per Tops/s per server 16.62 13.25 11.74 TCO per Tops/s per server 46.22 44.28 46.51 (d) ......

*Energy-optimal server uses lower voltage to increase the energy efficiency. Cost-optimal server uses higher voltage to increase silicon efficiency. TCO-optimal server has a voltage between these two and balances energy versus silicon cost.

...... MAY/JUNE 2017 67 ...... TOP PICKS

NRE by more and more, the required speedup Video transcoding to break even declines. As a result, almost any 104 accelerator proposed in the literature, no mat- ter how modest the speedup, is a candidate for 103 Litecoin ASIC Cloud, depending on the scale of the Bitcoin Deep learning computation. Our research makes the key 102 contribution of noting that, in the deploy- ment of ASIC Clouds, NRE and scale can be 101 more determinative than the absolute speedup of the accelerator. The main barrier for ASIC 100 CloudsistoreigninNREcostssotheyare

TCO improvement over baseline TCO improvement appropriate for the scale of the computation. 10–1 In many research accelerators, TCO improve- CPU Cloud GPU Cloud ASIC Cloud ments are extreme (such as in Figure 5), but authors often unnecessarily target expensive, Figure 4. CPU Cloud versus GPU Cloud versus ASIC Cloud death match. latest-generation process nodes because they ASIC servers greatly outperform the best non-ASIC alternative in terms of are more cutting-edge. This tendency raises TCO per operations per second (ops/s). the NRE exponentially, reducing economic feasibility. A better strategy is to target the older nodes that still attain sufficient TCO improvements. Our most recent work sug- gests that a better strategy is to lower NRE 11 cost by targeting older nodes that still have 10 sufficient TCO per ops/s benefit.5 9 8 7 ur research generalizes primordial Bit- Knee of curve (two-for-two rule) 6 O coin ASIC Clouds into an architec- 5 tural template that can apply across a range 4 of planet-scale applications. Joint knowledge 3 and control over datacenter and hardware 2 1 design allows for ASIC Cloud designers to 0 select the optimal design that optimizes energy Minimum required TCO improvement Minimum required 12345678910 and cost proportionally to optimize TCO. TCO/NRE ratio Looking to the future, our work suggests that both Cloud providers and silicon foundries Figure 5. The two-for-two rule. Moderate speedup with low non-recurring would benefit by investing in technologies engineering (NRE) beats high speedup at high NRE. The points are break- that reduce the NRE of ASIC design, includ- even points for ASIC Clouds. ing open source IP such as RISC-V, in new labor-saving development methodologies for the cost of developing the ASIC server, hardware and in open source back-end CAD which includes both the mask costs (about tools. With time, mask costs fall by them- $1.5 million for the 28-nm node we con- selves, but older nodes such as 65 nm and sider here), and the ASIC design costs, which 40 nm may provide suitable TCO per ops/s collectively comprise the NRE expense. To reduction, with one-third to half the mask understand this tradeoff, we proposed the cost and only a small difference in perform- two-for-two rule. If the cost per year (that is, ance and energy efficiency from 28 nm. the TCO) for running the computation on This is a major shift from the conventional an existing cloud exceeds the NRE by two wisdom in architecture research, which times, and you can get at least a two-times often chooses the best process even though TCO improvement per ops/s, then building it exponentially increases NRE. Foundries an ASIC Cloud is likely to save money. also should take interest in ASIC Cloud’s low- Figure 5 shows a wider range of break-even voltage scale-out design patterns because they points. Essentially, as the TCO exceeds the lead to greater silicon wafer consumption ...... 68 IEEE MICRO than CPUs within fixed environmental ence and Engineering at the University of energy limits. California, San Diego. His research interests With the coming explosive growth of include ASIC Clouds, low-cost ASIC design, planet-scale computation, we must work to and systems. Vega received an MSc in electri- contain the exponentially growing environ- cal and computer engineering from the Uni- mental impact of datacenters across the versity of Kaiserslautern, Germany. Contact world. ASIC Clouds promise to help address him at [email protected]. this problem. By specializing the datacenter, they can do greater amounts of computation Ikuo Magaki is an engineer at Apple. He under environmentally determined energy performed the work for this article as a limits. The future is planet-scale, and special- Toshiba visiting scholar in the Department ized ASICs will be everywhere. MICRO of Computer Science and Engineering at the University of California, San Diego. His Acknowledgments research interests include ASIC design and This work was partially supported by NSF ASIC Clouds. Magaki received an MSc in awards 1228992, 1563767, and 1565446, computer science from Keio University, Japan. and by STARnet’s Center for Future Archi- Contact him at [email protected]. tectures Research, a SRC program sponsored by MARCO and DARPA. Michael Bedford Taylor advises his PhD students at various well-known west coast ...... universities. He performed the work for this References article while at the University of California, 1. M.B. Taylor, “A Landscape of the Dark Sili- San Diego. His research interests include con Design Regime,” IEEE Micro, vol. 33, tiled multicore architecture, dark silicon, no. 5, 2013, pp. 8–19. HLS accelerators for mobile, Bitcoin min- 2. I. Magaki et al., “ASIC Clouds: Specializing ing hardware, and ASIC Clouds. Taylor the Datacenter,” Proc. 43rd Int’l Symp. received a PhD in electrical engineering and Computer Architecture, 2016, pp. 178–190. computer science from the Massachusetts 3. N. Goulding-Hotta et al., “The GreenDroid Institute of Technology. Contact him at Mobile Application Processor: An Architecture [email protected]. for Silicon’s Dark Future,” IEEE Micro,vol.31, no. 2, 2011, pp. 86–95. 4. M.B. Taylor, “Bitcoin and the Age of Bespoke Silicon,” Proc. Int’l Conf. Com- pilers, Architectures and Synthesis for Embedded Systems, 2013, article 16. 5. M. Khazraee et al., “Moonwalk: NRE Optimi- zation in ASIC Clouds,” Proc. 22nd Int’l Conf. Architectural Support for Programming Languages and Operating Systems, 2017, pp. 511–526.

Moein Khazraee is a PhD candidate in the Department of Computer Science and Engi- neering at the University of California, San Diego. His research interests include ASIC Clouds, NRE, and specialization. Khazraee received an MS in computer science from the University of California, San Diego. Contact him at [email protected]. Read your subscriptions through Luis Vega Gutierrez is a staff research asso- the myCS publications portal at http://mycs.computer.org. ciate in the Department of Computer Sci- ...... MAY/JUNE 2017 69 ...... DRAF: A LOW-POWER DRAM-BASED RECONFIGURABLE ACCELERATION FABRIC

......

THE DRAM-BASED RECONFIGURABLE ACCELERATION FABRIC (DRAF) USES COMMODITY

DRAM TECHNOLOGY TO IMPLEMENT A BIT-LEVEL, RECONFIGURABLE FABRIC THAT

IMPROVES AREA DENSITY BY 10 TIMES AND POWER CONSUMPTION BY MORE THAN

Mingyu Gao 3 TIMES OVER CONVENTIONAL FIELD-PROGRAMMABLE GATE ARRAYS.LATENCY

Stanford University OVERLAPPING AND MULTICONTEXT SUPPORT ALLOW DRAF TO MEET THE PERFORMANCE Christina Delimitrou AND DENSITY REQUIREMENTS OF DEMANDING APPLICATIONS IN DATACENTER AND Cornell University MOBILE ENVIRONMENTS. Dimin Niu ...... The end of Dennard scaling has bit-level logic functions using static RAM Krishna T. Malladi made it imperative to turn toward applica- (SRAM) based lookup tables and configurable tion- and domain-specific acceleration as an interconnects, both of which incur significant Hongzhong Zheng energy-efficient way to improve performance.1 area and power overheads. The poor logic Field-programmable gate arrays (FPGAs) density and high power consumption limit Bob Brennan have become a prominent acceleration plat- the functionality that one can implement form as they achieve a good balance between within an FPGA. Previous research has used Samsung Semiconductor flexibility and efficiency.2 FPGAs have enabled networks of medium-sized FPGAs3 or devel- accelerator designs for numerous domains, oped multicontext FPGAs4 to circumvent including datacenter computing,3 in which these limitations, but these approaches come Christos Kozyrakis applications are much more complex and with their own overheads. For details, see the Stanford University change frequently, and multitenancy sharing is sidebar, “FPGAs in Datacenters and Multi- a principal way to achieve resource efficiency. context Reconfigurable Fabrics.” For FPGA-based accelerators to become We developed the DRAM-Based Recon- widely adopted, their cost must remain low. figurable Acceleration Fabric (DRAF), a This is an issue both for large-scale datacenters reconfigurable fabric that improves logic den- that are optimized for total cost of ownership sity and reduces power consumption through and for small mobile devices that have strict the use of dense, commodity DRAM arrays. budgets for power and chip area. Unfortu- DRAF is bit-level reconfigurable and has sim- nately, conventional FPGAs realize arbitrary ilar flexibility as conventional FPGAs. DRAF ......

70 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE FPGAs in Datacenters and Multicontext Reconfigurable Fabrics The advantages of spatial programmability and post-fabrication lookup tables5 or in separate global backup memories.6 Both reconfigurability have made field-programmable gate arrays (FPGAs) approaches consume significant on-chip area for the additional stor- the most successful and widely used reconfigurable fabric for accel- age, greatly reducing the single-context logic capacity. In addition, erator designs in various domains. FPGAs provide bit-level reconfi- loading the configuration from the backup memories can result in gurability through lookup tables, which can implement arbitrary long context switch latency. Because of their large overheads, multi- combinational logic functions by storing the function outputs in small context FPGAs have not been widely adopted by industry. static RAM arrays. The typical lookup table granularity at the moment is 6-bit input and 1-bit output. FPGAs also have flip-flops for data retiming and temporary storage. The lookup tables and flip-flops References are grouped into configurable logic blocks, which are organized into 1. A. Putnam et al., “A Reconfigurable Fabric for Accelerat- a 2D grid layout with other dedicated DSP and block RAM blocks. A ing Large-Scale Datacenter Services,” Proc. 41st Ann. bit-level, statically configurable interconnect supports communica- Int’l Symp. Computer Architecture (ISCA), 2014, pp. tion between these blocks. 13–24. FPGAs have recently been used in datacenters as an accelera- tion fabric for cloud applications.1–3 Datacenter servers often host 2. J. Hauswald et al., “Sirius: An Open End-to-End Voice multiple complex applications. Hence, multiple large FPGA devices and Vision Personal Assistant and Its Implications for are often necessary to provide sufficient resources for multiple Future Warehouse Scale Computers,” Proc. 20th Int’l large accelerators. Unfortunately, the tight power budget and the Conf. Architectural Support for Programing Languages focus on total cost of ownership make it impractical to introduce and Operating Systems (ASPLOS), 2015, pp. 223–238. expensive, power-hungry devices. To counteract these issues, 3. R. Polig et al., “Giving Text Analytics a Boost,” IEEE Microsoft proposed the Catapult design, using medium-sized Micro, vol. 34, no. 4, 2014, pp. 6–14. FPGAs with custom-designed interconnects linked between 4. T.R. Halfhill, “Tabula’s Time Machine,” Microprocessor 1 them. Although it improves performance, this approach increases Report, vol. 131, 2010. the system complexity and design integration cost, while still sup- 5. E. Tau et al., “A First Generation DPGA Implementation,” porting only a single application on the acceleration fabric. Proc. 3rd Canadian Workshop Field-Programmable Devi- Multicontext reconfigurable fabrics4 can support multitenancy ces (FPD), 1995, pp. 138–143. sharing by allowing rapid runtime switch between multiple designs (contexts) that are all mapped onto a single fabric, similar to hard- 6. S. Trimberger et al., “A Time-Multiplexed FPGA,” Proc. ware-supported thread switching in multithreaded processors. Such 5th IEEE Symp. FPGA-Based Custom Computing fabrics store all context configurations on chip, either in specialized Machines (FCCM), 1997, p. 22. includes architectural optimizations, such as area and cost efficiency by using very wide latency overlapping and multicontext sup- inputs (address) and outputs (data). Such port with fast context switching, that allow it wide granularity does not match the rela- to transform slow DRAM into a performant tively fine-grained logic functions in most reconfigurable fabric suitable for both data- real-world accelerator designs, resulting in centers and mobile platforms. underutilization of the DRAM-based lookup tables. Simply reducing the I/O width of DRAM arrays would forfeit the density ben- Challenges for DRAM-Based FPGAs efit, because the peripheral logic would now Dense DRAM technology provides a new dominate the lookup table area. Second, approach to realize high-density, low-power DRAM access speed is 30 times slower than reconfigurable fabrics necessary in constrained that of SRAM arrays (2 to 10 ns versus 0.1 environments such as datacenters and mobile to 0.5 ns). Without careful optimization, a devices. However, simply replacing the 30 times slower FPGA would hardly provide SRAM-based lookup tables in FPGAs with any acceleration over programmable process- DRAM-based cell arrays would lead to crit- ors. Third, implementing large and complex ical challenges in logic utilization, perform- logic functions often requires multiple lookup ance, and even operation correctness. First, tables to be chained together, which is prob- DRAM arrays are heavily optimized for lematic with DRAM lookup tables, because ...... MAY/JUNE 2017 71 ...... TOP PICKS

Table 1. Comparison of the DRAM-Based Reconfigurable Acceleration Fabric (DRAF) and a conventional field-programmable gate array (FPGA).

Features Conventional FPGA DRAF

Lookup table technology Static RAM (SRAM) DRAM Lookup table delay Short (0.1 to 1 ns) Long (1 to 10 ns) Lookup table output width Single bit Multiple bits Logic capacity Moderate Very high No. of configurations Single Multiple (4 to 8) Power consumption Moderate Low

the destructive nature of DRAM accesses (CLB) contains lookup tables made with requires explicit activation and precharge DRAM cell arrays and conventional flip- operations with precise timing. Without care- flops. The lookup table supports multiple ful management and coordination between on-chip configurations, each stored in one of lookup tables, the lookup table contents the contexts. The digital signal processing would be destroyed if accessed with an unsta- (DSP) block is used for complex arithmetic ble input. Finally, DRAM requires periodic operations, and the block RAM (BRAM) is refresh operations, which could negatively for on-chip data storage. They are similar to impact system power consumption and appli- those in FPGAs, but implemented in DRAM cation performance. technology, which makes their latency and area worse than the corresponding imple- DRAF Architecture mentation in a logic process. However, as DRAF leverages DRAM technology to imple- we will show, the DRAM-based lookup ment a reconfigurable fabric with higher logic table will also have much higher latency capacity and lower power consumption than than an SRAM-based lookup table; there- conventional FPGAs. Table 1 summarizes the fore, the increased latencies of DSP and keyfeaturesofDRAFascomparedtoacon- BRAM are not critical and do not dominate ventional FPGA. the overall design critical path. In addition, DRAF implements several key architec- the combinational DSP logic does not tural optimizations to overcome the challenges need to be replicated across contexts, thus discussed in the previous section. First, it uses offsetting its area overhead. The DRAM a specialized DRAM lookup table design that array in the BRAM block is similar to the achieves both high density and high utiliza- lookup tables, but with larger capacity, and tion by using a narrower output width and is used for data storage rather than design flexible column logic. Second, it uses a simple configurations. phase-based solution to specify the correct The blocks in DRAF are organized in a timing for each lookup table, and a three-way 2D grid layout similar to that of conventional delay overlapping technique to significantly FPGAs(seeFigure1a).TheDRAFintercon- reduce the impact of DRAM operation nect uses a simple and static time-multiplexing 5 latencies. Third, DRAF coordinates DRAM scheme to support multiple contexts. refresh in the device driver to reduce its power and latency impact. Finally, DRAF provides Configurable efficient multicontext support, which opens Figure 1b shows the structure of the CLB in up the opportunity for sharing the accelera- DRAF. The density advantage of DRAM tion fabric between multiple applications, technology allows a DRAF CLB to provide greatly decreasing the overall system cost. 10 times the logic capacity over an FPGA CLB within the same area. The CLB con- Overview tains a few DRAM-based lookup tables and Similar to an FPGA, DRAF uses three types the associated flip-flops and auxiliary multi- of logic blocks. The configurable logic block plexers. The inputs of the lookup table are ...... 72 IEEE MICRO EN CLB CLB CLB CLB EN EN EN

CLB CLB CLB CLB 6 CTX CTX CTX Local wordline

Row dec [0] [1] [7] INPUT_0[13:0] 14 IN_col IN_WL × IN_col IN_WL IN_col IN_WL DSP DSP DSP 4 2 OUT OUT OUT IN_WL[i] FFs FFs FFs (master wordline)

Context ouput MUX 4 OUTPUT_0[3:0] BRAM BRAM Bitline EN EN EN MAT 3 CLB CLB CLB CLB 6 CTX CTX CTX

Row dec [0] [1] [7] Sense-amps 4×2 CLB CLB CLB CLB 14 4×2 IN_colOUT IN_WL IN_colOUT IN_WL IN_colOUT IN_WL IN_col 2 FFs FFs FFs (col addr) DSP DSP DSP 4 Context ouput MUX 4 OUTPUT_1[3:0] INPUT_1[13:0] CTX_SEL OUT (a) (b) (c)

Figure 1. The DRAM-Based Reconfigurable Acceleration Fabric (DRAF) architecture. (a) The block layout of DRAF, similar to an FPGA. Block sizes and numbers can vary across devices. (b) The configurable logic block (CLB) in DRAF. As a typical example, a CLB contains two DRAM-based lookup tables and associated flip-flops (FFs) organized into eight contexts. Each lookup table has an 8-bit (6 bits for row and 2 bits for column) input and a 4-bit output. (c) Detailed view of one context in a DRAF lookup table with the context enable gate and specialized column logic. split into two parts and connected to the row amortizing the area overheads of the shared and column address ports of the DRAM peripheral logic (such as the row decoder). array, respectively. To support multicontext, To further increase the logic utilization each lookup table is divided into four to eight and flexibility, we apply a specialized column contexts, leveraging the hierarchical structure logic to allow for each output bit to be inde- in modern DRAM chips, in which the array pendently selected from the corresponding set is divided into DRAM MATs. Each context of bitlines. As Figure 1c shows, rather than in DRAF is a modified DRAM MAT (see sharing the same column address for all bits as Figure 1c). The decoded row address will acti- in conventional DRAM, we organize the 16 vate a single local wordline, which connects bitlines into four groups, and provide each the cells in that row to the bitlines. The data group a separate set of 2 bits to select one out- are then transferred to the sense-amplifiers put from the 4 bits. This additional level of and augmented to full-swing signals. multiplexing further reduces the output width The typical MAT width and height in to 4 bits, while allowing each bit to have parti- commodity DRAM devices are 512 to 1,024 ally different input bits, increasing the flexibil- cells. This implies a 9-bit-input, 512-bit- ity of the lookup table functionality. output lookup table, whereas a typical FPGA lookup table has a 6-bit input and 1-bit Multicontext Support output. To bring the DRAF lookup table DRAF seamlessly supports multicontext granularity close to the needs of real-world operations by storing each design configura- applications to increase the logic utilization, tion in one MAT and allowing for single- we make each MAT narrower, reducing its MAT access. Effectively, each MAT forms width to 8 to 16 bits. This offers a good one context across all lookup tables. The tradeoff between utilization and density. multiple contexts in one device can be used The aggregated row size of all contexts is still for different independent logic designs, each in the order of hundreds of bits, sufficiently of which accelerates one application running ...... MAY/JUNE 2017 73 ...... TOP PICKS

each lookup table, the context switch is instant LUT-1 LUT-2 by simply updating a context index counter (CTX SEL in Figure 1b) and the new context Reg LUT-4 Reg is ready to use in the next cycle.

Physical path LUT-3 Timing Management and Optimization DRAM access is destructive. Therefore, modern DRAM array organization introdu- ces a two-step access protocol. First, an entire DRAM row is activated and copied into the User clock sense-amplifiers according to the row address (activation); next, a subset of the sense- LUT-1 PRE ACT RST amplifiers are read or written on the basis of Δ = max(tPRE, tRST, troute) the column address. Because the original LUT-2 PRE ACT RST charge of the cells in the DRAM row is destroyed after the activation, the cells must be recharged or discharged to restore to the LUT-3 PRE ACT RST original values (restoration).7 Finally, we must precharge the bitlines and sense-amplifiers to LUT-4 PRE ACT RST prepare for the next activation (precharging). Phase 0 Phase 1 Phase 2 The explicit activation, restoration, and pre- charging create two major challenges for using DRAM in a reconfigurable fabric. Figure 2. The timing diagram and critical path for a chain of four DRAF First, when multiple DRAM-based lookup lookup tables. Each clock cycle is decomposed into three phases. LUT-1 and tables are chained together for a large logic LUT-3 are in phase 0, LUT-2 is in phase 1, and LUT-4 is in phase 2. D function, we must enforce a specific order for represents the three-way overlapping of restoration, precharging, and each lookup table access and the correspond- routing delays. ing timing constraints, to avoid loss of con- figuration data. DRAF uses a phase-based on the shared system. Alternatively, we can timing solution. We divide the accelerator split a single large and complicated design, design (user) cycle into multiple phases and such as websearch in Microsoft Catapult,3 assign a specific phase for each lookup table and map each part to one context in order to in the design (see Figure 2). By requiring the fit the entire design on a single device instead phase of a lookup table to be greater than the of using a multi-FPGA network. phases of all lookup tables producing its We leverage the hierarchical wordline input signals, we can guarantee the correct structure in DRAM to decouple the accesses access order. We also delay the precharge to each MAT by adding an enable AND gate operation into the next user cycle, ensuring to each local wordline driver,6 as shown in that the lookup table output is valid across Figure 1c. This lets us access only a single different phases (for example, from LUT-2 MAT corresponding to the current selected and LUT-3 to LUT-4 in Figure 2). The context, while disabling the other MATs. A phase assignment can be implemented by a context output multiplexer selects the enabled CAD tool using techniques similar to critical context for the lookup table output port. path finding. The phase information is The multicontext support in DRAF is par- stored in a small local controller per lookup ticularly efficient. First, the area overhead is table. There is no need for global coordina- negligible, because the peripheral logic (for tion at runtime. example, the row decoder) is shared between Second, the restoration and precharging of contexts. Second, the idle contexts (MATs) are DRAM arrays introduce high latency over- not accessed, introducing little dynamic power heads7 that limit the design frequency to no overhead, and they can be further power-gated more than 20 MHz. To hide these overheads, to reduce static power. Third, because the DRAF applies a three-way latency overlapping design configurations are stored in-place in without violating the timing constraints. As ...... 74 IEEE MICRO Figure 2 shows, we overlap the charge restora- density, multiple contexts, and low power tion of the source lookup table that produces consumption. These features make DRAF a signal, with the time for precharging the des- devices appropriate for both mobile and server tination lookup table of this signal and the applications in which one wants to introduce time for routing this signal between the two an FPGA device for acceleration without sig- lookup tables. Because routing delay is typi- nificantly impacting existing systems’ power cally the dominant latency component in budget, airflow, and cooling constraints. FPGAs,8 this critical optimization lets DRAF In datacenters that host public and private beonlytwotothreetimesslowerthanan clouds, servers are routinely shared by multi- FPGA and provides reasonable performance ple diverse applications to increase utiliza- speedup over programmable cores. tion. Different applications and different portions of each application (for example, DRAM Refresh RPC communication versus security versus DRAM requires periodic refresh due to cell main algorithm) require different accelera- leakage. We refresh all lookup tables in a tors. The long reconfiguration latency of DRAF chip concurrently using a shared row conventional FPGAs leads to nonnegligible address counter in each CLB and BRAM application downtime,3 decreasing the sys- block. This is easier for DRAF than for com- tem availability and making it expensive to modity DRAM chips in terms of power share the acceleration fabrics. consumption, becausethearraysaremuch In contrast, DRAF provides a shared fabric smaller in DRAF. All utilized contexts are that supports multiple accelerators by using refreshed simultaneously,andunusedcontexts different contexts, which can be viewed as are skipped. The DRAF device driver coordi- multiple independent FPGA instances that nates the refresh by holding on to new need to be used in a time-multiplexed fashion. requests, ignoring output data, and pausing The high logic density ensures that each indi- ongoing operations similar to processor pipe- vidual context has sufficient capacity for the line stalls. The internal states in the DRAF different accelerator designs. The instantane- design are held in the flip-flops and are not ous context switch ensures that the desired affected. The pause period is less than 1 lsper accelerator becomes immediately available to 64 ms, which is negligible even for latency- use when needed, with negligible overhead in critical applications in datacenters that require energy and no application downtime. Being millisecond-level tail latency. able to share the acceleration fabric can greatly reduce the overall system cost while Design Flow still enjoying the benefits of special-purpose The success of a reconfigurable fabric relies acceleration. heavily on the CAD toolchain support. Because DRAF uses the same primitives Evaluation (lookup tables, flip-flops, DSPs, and BRAMs) We evaluate DRAF as a reconfigurable fabric as modern FPGAs, its design flow is similar for datacenter applications using a wide set of to that of FPGAs with some mild tuning. accelerator designs for representative compu- First, the tool now needs to pack more logic tation kernels commonly used in large-scale per lookup table to utilize the larger lookup production services, including both latency- tables. Second, the primary optimization critical online services and batch data analytics. goal should be latency, because area is not a We use seven-input, two-output, eight-context scarce resource anymore. Third, the tool lookup tables in DRAF, because they achieve must enforce all timing requirements, includ- a good tradeoff between efficiency and logic ing the phases and DRAM timing constraints. utilization and flexibility. We compare DRAF Finally, the tool should take advantage of the to an FPGA similar to a Virtex-6 device multicontext support. and a programmable processor (Intel Xeon E5-2630 at 2.3 GHz). For a fair comparison, Use of DRAF for Datacenter Accelerators the accelerator designs are synthesized using DRAF trades off some of the potential per- the same open-source CAD tools for both the formance of FPGAs to achieve high logic conventional FPGA and DRAF. The DRAF ...... MAY/JUNE 2017 75 ...... TOP PICKS

103 103

2 2

) 10 10 2

101 101

100 100

Chip area (mm Chip area 10–1 10–1 Peak chip power (W) FPGA DRAF FPGA DRAF 10–2 10–2 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Logic capacity Logic capacity (a)(in million 6-LUT equivalents) (b) (in million 6-LUT equivalents)

FPGA logic DRAF logic FPGA logic DRAF logic FPGA routing DRAF routing FPGA routing DRAF routing 1.2 1.2 1.0 1.0

0.8 0.8

0.6 0.6 0.4 0.4

0.2 Normalized power 0.2

0.0 0.0 Normalized minimum bounding area aes gmm aes gmm gemm gemm harris harris stemmer stencil stemmer stencil (c)viterbi (d) viterbi backprop backprop

Figure 3. Area and power comparison between DRAF and a conventional FPGA. (a, b) Device-level comparison. (c, d) Comparison after real application mapping.

results are conservative compared to the state-of-the-art Virtex-UltraScaleþ FPGAs programmable core baselines, because highly that use a much more recent 16-nm tech- optimized commercial tools are likely to gen- nology. The power consumption advantage is erate more efficient mappings of accelerator also remarkable. Although the FPGA power designs on the DRAF fabric. Our full paper can easily exceed 10 W, DRAF consumes contains a complete description of our only about 1 to 2 W. methodology.9 Figures 3c and 3d further compare DRAF to FPGA using real accelerator designs. We Area and Power map each accelerator to one of the eight avail- Figures 3a and 3b compare the area and peak able contexts in DRAF. The other unused power consumption of DRAF and FPGA contexts still contribute to the area, consume devices with different logic capacities meas- leakage power, and introduce a slight access ured in 6-bit-input lookup table equivalents latency penalty in the DRAF lookup tables. for 45-nm technology. For a fixed logic On average, each accelerator design occupies capacity, an eight-context DRAF provides 19percentlessareaonDRAFthanonthe more than 10 times area improvement and FPGA, roughly matching the 10-times area roughly 50 times power consumption reduc- advantage if we consider the seven additional tion. If we target a cost-effective device size contexts available within the area occupied of 75 mm2, an FPGA can pack roughly in DRAF. DRAF’s area advantage stems pri- 200,000 lookup tables, whereas DRAF can marily from using lookup tables with wider have more than 1.5 million lookup tables, inputs and outputs; these lookup tables a logic capacity comparable to that of the can realize larger functions and also reduce ...... 76 IEEE MICRO pressure on the configurable interconnect. The gmm design uses more area in DRAF CPU 4 CPUs FPGA DRAF than the FPGA, because it requires exponen- 3 tial and logarithmic functions that are not 10 currently supported in our DSP blocks. 102 Regarding power, the FPGA power con- sumption is dominated by the routing fabric, 101 especially for larger designs. DRAF provides a 3.2 times power improvement on average, 100 resulting from both the more efficient Normalized throughput DRAM-based lookup tables and the savings 10–1 aes gmm

on routing due to denser packing. gemm harris stemmer stencil viterbi Performance backprop Finally, we compare the performance of Figure 4. Performance comparison between single-core, multicore, FPGA, accelerator designs mapped onto DRAF and and DRAF using representative datacenter application kernels. Assume FPGA devices to that of optimized software ideal scaling from single-core to multicore platform. running on general-purpose programmable cores. For the programmable cores, we opti- mistically assume ideal linear scaling to four upcoming dense nonvolatile memory tech- cores, owing to the abundant request-level nologies, such as spin-transfer torque RAM, parallelism in datacenter services. The chip exhibit good density scaling and have better 2 size for FPGA and DRAF is fixed at 75 mm . static power characteristics compared to Figure 4 shows that both FPGA and DRAM. An exciting research direction is to DRAF outperform the single-core baseline, extend DRAF to exploit the advantages and on average by 37 and 13 times, respectively. address the shortcomings of new memory When compared to four cores with ideal technologies in order to produce acceleration speedup, DRAF still exhibits significant fabrics with low area and power cost. MICRO speedup of 3.4 times while consuming just 0.63 W, compared to 7 to 10 W of a single ...... core in Xeon-class processors. Overall, these References results establish DRAF as an attractive and 1. M. Horowitz, “Computing’s Energy Prob- flexible acceleration fabric for cost (area) and lem (and What We Can Do About it),” Proc. power constrained environments. IEEE Int’l Solid-State Circuits Conf. (ISSCC), 2014, pp. 10–14. RAF is the first complete design to use 2. R. Tessier, K. Pocek, and A. DeHon, D dense, commodity DRAM technology “Reconfigurable Computing Architectures,” to implement a reconfigurable fabric with Proc. IEEE, vol. 103, no. 3, 2015, pp. 332–354. significant logic density and power improve- 3. A. Putnam et al., “A Reconfigurable Fabric ments over conventional FPGAs. DRAF for Accelerating Large-Scale Datacenter provides a low-cost solution for multicontext Services,” Proc. 41st Ann. Int’l Symp. acceleration fabrics, which are expected to Computer Architecture (ISCA), 2014, pp. become widely used in future multitenant 13–24. cloud and mobile systems. Looking forward, it is important to tune CAD tools and run- 4. S. Trimberger et al., “A Time-Multiplexed time management systems to efficiently map FPGA,” Proc. 5th IEEE Symp. FPGA-Based accelerator designs on DRAF, taking full Custom Computing Machines (FCCM), advantage of its high-density and multicon- 1997, p. 22. text features. 5. B. Van Essen et al., “Static versus Sched- The techniques that allow DRAF to turn uled Interconnect in Coarse-Grained Recon- dense storage technology to cost-effective figurable Arrays,” Proc. Int’l Conf. Field reconfigurable fabrics are also applicable to Programmable Logic and Applications (FPL), memory technologies beyond DRAM. The 2009, pp. 268–275...... MAY/JUNE 2017 77 ...... TOP PICKS

6. A.N. Udipi et al., “Rethinking DRAM Design puter science and engineering from Penn- and Organization for Energy-Constrained sylvania State University. Contact him at Multi-cores,” Proc. 37th Ann. Int’l Symp. Com- [email protected]. puter Architecture (ISCA), 2010, pp. 175–186. 7. Y.H. Son et al., “Reducing Memory Access Krishna T. Malladi is a staff architect in the Latency with Asymmetric DRAM Bank Memory Solutions Lab in the US R&D cen- Organizations,” Proc. 40th Ann. Int’l Symp. ter at Samsung Semiconductor. His research Computer Architecture (ISCA), 2013, pp. interests include next-generation memory 380–391. and storage systems for datacenter platforms. Malladi received a PhD in electrical engi- 8. V. Betz et al., Architecture and CAD for neering from Stanford University. Contact Deep-Submicron FPGAs, Kluwer Academic him at [email protected]. Publishers, 1999. 9. M. Gao et al., “DRAF: A Low-Power DRAM- Hongzhong Zheng is a senior manager in based Reconfigurable Acceleration Fabric,” the Memory Solutions Lab in the US R&D Proc. ACM/IEEE 43rd Ann. Int’l Symp. Com- center at Samsung Semiconductor. His puter Architecture (ISCA), 2016, pp. 506–518. research interests include novel memory sys- tem architecture with DRAM and emerging Mingyu Gao is a PhD student in the Depart- memory technologies, processing in-memory ment of Electrical Engineering at Stanford architecture for machine learning applica- University. His research interests include tions, computer architecture and system energy-efficient computing and memory performance modeling, and energy-efficient systems, specifically on practical and effi- computing system designs. Zheng received a cient near-data processing for data-intensive PhD in electrical and computer engineering analytics applications, high-density and low- from the University of Illinois at Chicago. power reconfigurable architectures for data- He is a member of IEEE and ACM. Contact center services, and scalable accelerators for him at [email protected]. large-scale neural networks. Gao received an MS in electrical engineering from Stanford Bob Brennan is the senior vice president of University. He is a student member of IEEE. the Memory Solutions Lab in the US R&D Contact him at [email protected]. center at Samsung Semiconductor. He has led numerous research projects on memory Christina Delimitrou is an assistant profes- system architecture, SoC architecture, CPU sor in the Departments of Electrical and validation, and low-power system design. Computer Engineering and Computer Sci- BrennanreceivedanMSinelectricalengi- ence at Cornell University, where she works neering from the University of Virginia. Con- on computer architecture and distributed sys- tact him at [email protected]. tems. Her research interests include resource- efficient datacenters, scheduling and resource Christos Kozyrakis is an associate professor management with quality-of-service guaran- in the Departments of Electrical Engineering tees, disaggregated cloud architectures, and and Computer Science at Stanford Univer- cloud security. Delimitrou received a PhD in sity, where he investigates hardware architec- electrical engineering from Stanford Univer- tures, system software, and programming sity. She is a member of IEEE and ACM. models for systems ranging from cell phones Contact her at [email protected]. to warehouse-scale datacenters. His research interests include resource-efficient cloud Dimin Niu is a senior memory architect in computing, energy-efficient computing and the Memory Solutions Lab in the US R&D memory systems for emerging workloads, center at Samsung Semiconductor. His and scalable operating systems. Kozyrakis research interests include computer archi- has a PhD in computer science from the tecture, emerging nonvolatile memory tech- University of California, Berkeley. He is a nologies, and processing near-/in-memory fellow of IEEE and ACM. Contact him at architecture. Niu received a PhD in com- [email protected]...... 78 IEEE MICRO

...... AGILE PAGING FOR EFFICIENT MEMORY VIRTUALIZATION

...... VIRTUALIZATION PROVIDES BENEFITS FOR MANY WORKLOADS, BUT THE ASSOCIATED

OVERHEAD IS STILL HIGH.THE COST COMES FROM MANAGING TWO LEVELS OF ADDRESS

TRANSLATION WITH EITHER NESTED OR SHADOW PAGING.THIS ARTICLE INTRODUCES

AGILE PAGING, WHICH COMBINES THE BEST OF BOTH NESTED AND SHADOW PAGING

WITHIN A PAGE WALK TO EXCEED THE PERFORMANCE OF BOTH TECHNIQUES.

...... Two important trends in comput- duced significant performance overhead due Jayneel Gandhi ing are evident. First, computing is becoming to TLB misses causing hardware page walks. more data-centric, wherein low-latency access Even the TLBs in the recent Intel Skylake pro- Mark D. Hill to a very large amount of data is critical. Sec- cessor architecture cover only 9 percent of a ond, virtual machines are playing an increas- 256-Gbyte memory. This mismatch between Michael M. Swift ingly critical role in server consolidation, TLB reach and memory size will keep growing security, and fault tolerance as substantial with time. University of amounts of computing migrate to shared In addition, our experiments show virtu- resources in cloud services. Because software alization increases page-walk latency by two Wisconsin–Madison accesses data using virtual addresses, fast to three times compared to unvirtualized exe- address translation is a prerequisite for effi- cution. The overheads are due to two levels cient data-centric computation and for pro- of page tables: one in the guest virtual viding the benefits of virtualization to a wide machine (VM) and the other in the host vir- range of applications. Unfortunately, the tual machine monitor (VMM). There are growth in physical memory sizes is exceeding two techniques to manage these two levels of the capabilities of the most widely used page tables: nested paging and shadow pag- virtual memory abstraction—paging—which ing. In this article, we explain the tradeoffs has worked for decades. between the two techniques that intrinsically Translation look-aside buffer (TLB) sizes lead to high overheads of virtualizing mem- have not grown in proportion to growth in ory. With current hardware and software, the memory sizes, causing a problem of limited overheads of virtualizing memory are hard to TLB reach: the fraction of physical memory minimize, because a VM exclusively uses one that TLBs can map reduces with each hardware technique or the other. This effect, com- generation. There are two key factors causing bined with limited TLB reach, is detrimental limited TLB reach: first, TLBs are on the crit- for many virtualized applications and makes ical path of accessing the L1 cache and thus virtualization unattractive for big-memory have remained small in size, and second, mem- applications.1 ory sizes and the workload’s memory demands This article addresses the challenge of have increased exponentially. This has intro- high overheads of virtualizing memory in a ......

80 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE Table 1. Tradeoffs provided by memory virtualization techniques as compared to base native. Agile paging exceeds the best of both worlds.

Properties Base native Nested paging Shadow paging Agile paging

Translation look-aside Fast (gVA!hPA) Fast (gVA!hPA) Fast (gVA!hPA) Fast (gVA!hPA) buffer (TLB) hit Memory accesses per 4 24 4 Approximately 4 to 5 TLB miss on average Page table updates Fast: direct Fast: direct Slow: mediated by the virtual Fast: direct machine monitor (VMM) Hardware support Native page Nested þ native Native page walk Nested þ native page ...... walk page walk walk with switching....

*gVA!hPA: guest virtual address to host physical address.

comprehensive manner. It proposes a hard- a complete translation: one points to the guest ware/software codesign called agile paging page table (gcr3 in x86-64), and the other for fast virtualized address translation to points to the host page table (ncr3). address the needs of many different big- In the best case, the virtualized address memory workloads. Our goal, originally set translation has a hit in the TLB to directly forth in our paper for the 43rd Interna- translate from gVA to hPA with no overheads. tional Symposium on Computer Architec- In the worst case, a TLB miss needs to per- ture,2 is to minimize memory virtualization form a nested page walk that multiplies over- overheads by combining the hardware (nested heads vis-a-vis native (that is, unvirtualized 4- paging) and software (shadow paging) techni- Kbyte pages), because accesses to the guest ques, while exceeding the best performance of page table also require translation by the host both individual techniques. page table. Note that extra hardware is required for nested page walk beyond the base Techniques for Virtualizing Memory native page walk. Figure 1a depicts virtualized A key component enabling virtualization is address translation for x86-64. It shows its support for virtualizing memory with two how page table memory references grow levels of page tables: from a native 4 to a virtualized 24 references. Although page-walk caches can elide some of gVA!gPA: guest virtual address these references,3 TLB misses remain sub- (gVA) to guest physical address trans- stantially more expensive with virtualization. lation (gPA) via a per-process guest Despite the expense, nested paging allows OS page table. fast, direct updates to both page tables without gPA!hPA: guest physical address to any VMM intervention. host physical address (hPA) transla- tion via a per-VM host page table. Shadow Paging Table 1 shows the tradeoffs between nested Shadow paging is a lesser-used software tech- paging and shadow paging, the two techniques nique to virtualize memory. With shadow commonly used to virtualize memory, and paging, the VMM creates on demand a compares them to our agile paging proposal. shadow page table that holds complete trans- lations from gVA!hPA by merging entries Nested Paging from the guest and host tables. Nested paging is a widely used hardware tech- In the best case, as in nested paging, the niquetovirtualizememory.Theprocessorhas virtualized address translation has a hit in the two hardware page-table pointers to perform TLB to directly translate from gVA to hPA ...... MAY/JUNE 2017 81 ...... TOP PICKS

same as a base native walk. For example, x86- gVA 64 requires up to four memory references on a gVA TLB miss for shadow paging (see Figure 1b). gCR3 In addition, as a software technique, there is Guest page table no need for any extra hardware support for page walks beyond base native page walk. gPA hPA Even though TLB misses cost the same as native execution, this technique does not allow Host page table direct updates to the page tables, because the

gPA gPA gPA gPA gPA ncr3 ncr3 ncr3 ncr3 ncr3 shadow page table needs to be kept consistent hPA Memory accesses 5 + 5 + 5 + 5 + 4 = 24 with guest and host page tables.4 These updates (a) occur because of various optimizations, such as page sharing, page migrations, setting accessed gVA and dirty bits, and copy-on-write. Every page

Guest page table update requires a costly VMM interven- table (read only) gVA tion to fix the shadow page table by invalidat- ing or updating its entries, which causes Shadow page sCR3 table significant overheads in many applications.

hPA Host page table Memory accesses = 4 Opportunity Shadow paging reduces overheads of virtual- hPA izing memory to that of native execution if (b) the address space does not change. Our key observation is that empirically page tables are Figure 1. Nested paging has a longer page walk as compared to shadow not modified uniformly: some regions of an paging, but nested paging allows fast, in-place updates whereas shadow address space see far more changes than paging requires slow, mediated updates (guest page tables are read-only). others, and some levels of the page table, (a) Nested paging. (b) Shadow paging. such as the leaves, are updated far more often than the upper-level nodes. For example, code regions might see little change over the life of a process, whereas regions that mem- Guest virtual address space ory-map files might change frequently. Our Fully static address space experiments showed that generally less than Shadow paging preferred 1 percent and up to 5 percent of the address space changes in a 2-second interval of guest Fully dynamic address space Nested paging preferred application execution (see Figure 2).

Only a small fraction of Proposed Agile Paging Design address space is dynamic We propose agile paging as a lightweight sol- Opportunity for agile paging ution to reduce the cost of virtualized address translation. We use the opportunity we just Figure 2. Opportunity that agile paging uses to improve performance. describedtocombinethebestofshadowand Portions in white denote static portions, stripes denote dynamic portions, nested paging by using and solid gray denotes unallocated portions of the guest virtual address space. shadow paging with fast TLB misses for the parts of the guest page table that remain static, and with no overheads. On a TLB miss, the hard- nested paging for fast in-place ware performs a native page walk on the updates for the parts of the guest shadow page table. The native page table page tables that dynamically change. pointer points to the shadow page table (scr3). Thus, the memory references In the following subsections, we describe required for shadow page table walk are the the mechanisms that enable us to use both ...... 82 IEEE MICRO constituent techniques at the same time for a guest process, and we discuss policies gVA used by the VMM to select shadow or nested mode. Shadow page table Guest page table Mechanism: Hardware Support

Agile paging allows both techniques for the sCR3 1 gPA 1 same guest process—even on a single address translation—using modest hardware support Host to switch between the two. Agile paging has page table three hardware architectural page table pointers: one each for shadow, guest, and hPA host page tables. If agile paging is enabled, (a) virtualized page walks start in shadow paging and then switch, in the same page walk, to gVA gVA nested paging if required. To allow fine-grained switching from sCR3 gCR3 shadow paging to nested paging on any address translation at any level of the guest hPA page table, the shadow page table needs to logically support a new switching bit per Switch modes at level 4 of guest page table page table entry. This notifies the hardware gPA ncr3 page table walker to switch from shadow to Memory accesses 1 + 1 + 1 + 5 = 8 nested mode. When the switching bit is set (b) in a shadow page table entry, the shadow page table holds the hPA (pointer) of the Figure 3. Agile paging support. (a) Mechanism for agile paging: when the next guest page table level. Figure 3a depicts switching bit is set, the shadow page table points to the next level of the the use of the switching bit in the shadow guest page table. (b) Example page walk possible with agile paging, wherein page table for agile paging. Figure 3b shows it switches to nested mode at level four of the guest page table. a page walk that is possible with agile paging. The switching is allowed at any level of the page table. Shadow page table (gVA!hPA). For all guest Mechanism: VMM Support processes with agile paging enabled, the VMM Like shadow paging, the VMM for agile pag- creates and maintains a shadow page table. ing manages three page tables: guest, shadow, However, with agile paging, the shadow page and host. Agile paging’s page table manage- table is partial and cannot translate all gVAs ment is closely related to that of shadow pag- fully. The shadow page table entry at each ing, but there are subtle differences. switching point holds the hPA of the next level of the guest page table with the switch- Guest page table (gVA!gPA). With all ing bit set. This enables hardware to perform approaches, the guest page table is created and the page walk correctly with agile paging modified by the guest OS for every guest pro- using both techniques. cess. The VMM in shadow paging, though, controls access to the guest page table by Host page table (gPA!hPA). The VMM marking its pages read-only. With agile pag- manages the host page table to map from ing, we leverage the support for marking guest gPA to hPA for each VM. As with shadow page tables read-only with one subtle change. paging, the VMM merges this page table TheVMMmarksasread-onlyjusttheparts with the guest page table to create a shadow of the guest page table covered by the partial page table. The VMM must update the shadow page table. The rest of the guest page shadow page table on any changes to the host table (handled by nested mode) has full read- page table. The host page table is updated write access. only by the VMM, and during that update, ...... MAY/JUNE 2017 83 ...... TOP PICKS

in shadow mode, requiring VMM interven- Start tions for updates). Our experiments showed Write to page table Shadow Shadow that the updates to a single page of a guest (VMM trap) (1 write) page table are bimodal in a 2-second time interval: only one update or many updates Move Timeoutnon-dirty (for example, 10, 50, 100). Thus, we use a two-update policy to move a page of the Use dirty bits to track table Write to page guest page table from shadow mode to nested writes to guest page table (VMM traps) Nested mode: two successive updates to a page trig- ger a mode change. This allows all subse- Subsequent writes quent updates to frequently changing parts of (no VMM traps) the guest page table to proceed without VMM interventions. Figure 4. Policy to move a page between nested mode and shadow mode in Nested!Shadow mode. Once we move parts agile paging. of the guest page table to the nested mode, all updates to those parts happen without any VMM intervention. Thus, the VMM cannot the shadow page table is kept consistent by track if the parts under the nested mode have invalidating affected entries. stopped changing and thus can be moved back to the shadow mode. So, we use dirty Policy: What Level to Switch? bits on the pages containing the guest page Agile paging provides a mechanism for vir- table as a proxy to find these static parts of tualized address translation that starts in the guest page table after every time interval, shadow mode and switches at some level of and we switch those parts back to the shadow the guest page table to nested mode. The mode. Figure 4 depicts the policy used by purpose of a policy is to determine whether agile paging. to switch from shadow to nested mode for a To summarize, the changes to the hard- single virtualized address translation and at ware and VMM to support agile paging are which level of the guest page table the switch incremental, but they result in a powerful, should be performed. efficient, and robust mechanism. This mech- The ideal policy would determine that anism, when combined with our proposed page table entries are changing rapidly policies, helps the VMM detect changes enough and the cost of corresponding to the page tables and intelligently make a updates to the shadow page table outweighs decision to switch modes and thus reduce the benefit of faster TLB misses in shadow overheads. mode, and so translation should use nested Our original paper has more details on mode. The policy would quickly detect the the agile paging design to integrate page walk dynamically changing parts of the guest caches, perform guest context switches, set page table and switch them to nested mode accessed/dirty bits, and handle small or while keeping the rest of the static parts of short-lived processes. It also describes possi- 2 the guest page table under shadow mode. ble hardware optimizations. To achieve this goal, a policy will move some parts of the guest page table from Methodology shadow to nested mode and vice versa. We To evaluate our proposal, we emulate our assume that the guest process starts in full proposed hardware with and proto- shadow mode, and we propose a simple algo- type our software in Linux KVM.5 We rithm for when to change modes. selected workloads with poor TLB perform- ance from SPEC 2006,6 BioBench,7 Parsec,8 Shadow!Nested mode. We start a guest and big-memory workloads.9 We report process in the shadow mode to allow the overheads using a combination of hardware VMM to track all updates to the guest page performance counters from native and vir- table (the guest page table is marked read only tualized application executions, along with ...... 84 IEEE MICRO TLB performance emulation using a modi- fied version of BadgerTrap10 with a linear 90 28% performance model. Our original paper has 80 70% more details on our methodology, results, 70 11% and analysis.2 60 50 2% 19% 40 30% 30 6% Evaluation 18% 2% 20 Figure 5 shows the execution time overheads 10 4% 6% 3% 0

associated with page walks and VMM inter- Execution time overheads (%) BNSA BNSA BNSA BNSA ventions with 4-Kbyte pages and 2-Mbyte graph500 memcached canneal dedup pages (where possible). For each workload, (a) four bars show results for base native paging

(B), nested paging (N), shadow paging (S), 90 and agile paging (A). Each bar is split into 80 68% two segments. The bottom represents the 70 60 overheads associated with page walks, and 50 the top dashed segment represents the over- 40 13% 30 14% heads associated with VMM interventions. 4% 20 2% 5% Agile paging outperforms its constituent 10%14% 2% 10 3% 6% 2% techniques for all workloads and improves 0 performance by 12 percent over the best of Execution time overheads (%) BNSA BNSA BNSA BNSA nested and shadow paging on average, and graph500 memcached canneal dedup performs less than 4 percent slower than (b) unvirtualized native at worst. In our original paper,2 we show that more than 80 percent Figure 5. Execution time overheads for (a) 4-Kbyte pages and (b) 2-Mbyte of TLB misses are covered under full shadow pages (where possible) with base native (B), nested paging (N), shadow mode, thus having four memory accesses paging (S), and agile paging (A) for four representative workloads. All for TLB misses. Overall, the average number virtualized execution bars are in two parts: the bottom solid parts represent of memory accesses for a TLB miss comes page walk overheads, and the top hashed parts represent VMM intervention down from 24 to between 4 and 5 for all overheads. The numbers on top of the bars represent the slowdown with workloads. respect to the base native case.

e and others have found that the to 35 memory references, and emerging non- W overheads of virtualizing memory volatile memory technology promises vast can be high. This is true in part because guest physical memories. MICRO processes currently must choose between nesting paging with slow nested page table Acknowledgments walks and shadow paging, in which page This work is supported in part by the US table updates cause costly VMM interven- National Science Foundation (CCF- tions. Ideally, one would want to use nested 1218323, CNS-1302260, CCF-1438992, paging for addresses and page table levels that and CCF-1533885), Google, and the Uni- change and use shadow paging for addresses versity of Wisconsin (John Morgridge chair and page table levels that are relatively static. and named professorship to Hill). Hill and Our proposal—agile paging—approaches Swift have significant financial interests in this ideal. With agile paging, a virtualized AMD and Microsoft, respectively. address translation usually starts in shadow mode and then switches to nested mode only ...... if required to avoid VMM interventions. References Moreover, agile paging’s benefits could be 1. J. Buell et al., “Methodology for Perform- greater in the future, because Intel has ance Analysis of VMware vSphere Under recently added a fifth level to its page table11 Tier-1 Applications,” VMware Technical J., that makes a virtualized nested page walk up vol. 2, no. 1, 2013, pp. 19–28...... MAY/JUNE 2017 85 ...... TOP PICKS

2. J. Gandhi, M.D. Hill, and M.M. Swift, PhD in computer sciences from the Univer- “Agile Paging: Exceeding the Best of sity of Wisconsin–Madison, where he com- Nested and Shadow Paging,” Proc. 43rd pleted the work for this article. He is a mem- Int’l Symp. Computer Architecture, 2016, ber of ACM. Contact him at gandhij@ pp. 707–718. vmware.com. 3. R. Bhargava et al., “Accelerating Two- Dimensional Page Walks for Virtualized Mark D. Hill is the John P. Morgridge Systems,” in Proceedings of the 13th Inter- Professor, Gene M. Amdahl Professor of national Conference on Architectural Sup- Computer Sciences, and Computer Sciences port for Programming Languages and Department Chair at the University of Wis- Operating Systems, 2008, pp. 26–35. consin–Madison, where he also has a cour- tesy appointment in the Department of 4. K. Adams and O. Agesen, “A Comparison Electrical and Computer Engineering. His of Software and Hardware Techniques for research interests include parallel computer x86 Virtualization,” Proc. 12th Int’l Conf. system design, memory system design, and Architectural Support for Programming Lan- computer simulation. Hill has a PhD in guages and Operating Systems, 2006, pp. computer science from the University of 2–13. California, Berkeley. He is a fellow of IEEE 5. A. Kivity et al., “KVM: The Linux Virtual and ACM. He serves as vice chair of the Machine Monitor,” Proc. Linux Symp., vol. Computer Community Consortium. Con- 1, 2007, pp. 225–230. tact him at [email protected]. 6. J.L. Henning, “SPEC CPU2006 Benchmark Descriptions,” SIGARCH Computer Archi- Michael M. Swift is an associate professor tecture News, vol. 34, no. 4, 2006, pp. in the Computer Sciences Department at 1–17. the University of Wisconsin–Madison. His 7. K. Albayraktaroglu et al., “BioBench: A research interests include operating system Benchmark Suite of Bioinformatics reliability, the interaction of architecture Applications,” Proc. IEEE Int’l Symp. Per- and operating systems, and device driver formance Analysis of Systems and Soft- architecture. Swift has a PhD in computer ware, 2005, pp. 2–9. science from the University of Washington. He is a member of ACM. Contact him at 8. C. Bienia et al., “The Parsec Benchmark [email protected]. Suite: Characterization and Architectural Implications,” Proc. 17th Int’l Conf. Parallel Architectures and Compilation Techniques, 2008, pp. 72–81. 9. A. Basu et al., “Efficient Virtual Memory for Big Memory Servers,” Proc. 40th Ann. Int’l Symp. Computer Architecture, 2013, pp. 237–248. 10. J. Gandhi et al., “BadgerTrap: A Tool to Instrument x86-64 TLB Misses,” SIGARCH Computer Architecture News, vol. 42, no. 2, 2014, pp. 20–23. 11. 5-Level Paging and 5-Level EPT, white paper, Intel, Dec. 2016.

Jayneel Gandhi is a research scientist at VMware Research. His research interests include computer architecture, operating Read your subscriptions through systems, memory system design, virtual the myCS publications portal at http://mycs.computer.org. memory, and virtualization. Gandhi has a ...... 86 IEEE MICRO

...... TRANSISTENCY MODELS:MEMORY ORDERING AT THE HARDWARE–OS INTERFACE

......

THIS ARTICLE INTRODUCES THE TRANSISTENCY MODEL, A SET OF MEMORY ORDERING

RULES AT THE INTERSECTION OF VIRTUAL-TO-PHYSICAL ADDRESS TRANSLATION AND

MEMORY CONSISTENCY MODELS.USING THEIR COATCHECK TOOL, THE AUTHORS SHOW

HOW TO RIGOROUSLY MODEL, ANALYZE, AND VERIFY THE CORRECTNESS OF A GIVEN

SYSTEM’S MICROARCHITECTURE AND SOFTWARE STACK WITH RESPECT TO ITS Daniel Lustig TRANSISTENCY MODEL SPECIFICATION. Princeton University

Geet Sethi ...... Modern computer systems con- the hardware and the operating system sist of heterogeneous processing elements (OS) and require careful coordination Abhishek Bhattacharjee (CPUs, GPUs, accelerators) running multi- between the two. Although MCMs at the ple distinct layers of software (user code, instruction set architecture (ISA) and pro- Rutgers University libraries, operating systems, hypervisors) on gramming language levels are becoming top of many distributed caches and memo- increasingly well understood,1–5 a key veri- Margaret Martonosi ries. Fortunately, most of this complexity is fication challenge is that events within sys- hidden away underneath the virtual memory tem layers can behave differently than the Princeton University (VM) abstraction presented to the user code. “normal” accesses described by the ISA or However, one aspect of that complexity does programming language MCM. For exam- pierce through: a typical memory subsystem ple, on the x86-64 architecture, which will buffer, reorder, or coalesce memory implements the relatively strong total store requests in often unintuitive ways for the ordering (TSO) memory model,5 page table sake of performance. This results in essen- walks are automatically issued by hardware, tially all real-world hardware today exposing can happen at any time, and often are not aweakmemoryconsistencymodel(MCM) ordered even with respect to fences. Even to concurrent code that communicates worse is that while an ISA by design tends to through shared VM. remain stable across processor generations, The responsibilities for maintaining the microarchitectural phenomena often change VM abstraction and for enforcing the mem- dramatically from one generation to the next. ory consistency model are shared between For example, CPUs today are experimenting ......

88 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE with features such as concurrent page table Our contributions are as follows. First, we walkers and translation lookaside buffer developed a comprehensive methodology for (TLB) coalescing that improve performance specifying and statically verifying memory at the cost of adding significant complexity.6 ordering enforcement at the hardware–OS Consequently, VM and MCM specifications interface. Second, we built a fast and general- and implementations tend to be bug-prone purpose constraint solver that automates the and are only becoming more complex as sys- analysis of lspec microarchitecture specifica- tems become increasingly heterogeneous and tions. Third, as a case study, we built a distributed. sophisticated model of an Intel Sandy- Bogdan Romanescu and colleagues were Bridge-like processor running a Linux-like the first to distinguish between MCMs OS, and using that model we analyzed vari- meant for virtual addresses (VAMC) and ous translation-related memory ordering sce- those for physical addresses (PAMC).7 They narios of interest. Finally, we identified cases considered hardware to be responsible for the in which transistency goes beyond the tradi- latter, and a combination of hardware and tional scope of consistency: where even SC- OS for the former. However, as we show in for-VAMC7 is insufficient. Overall, our work this article, not even VAMC and PAMC cap- offers a rigorous yet practical framework for ture the full intersection of address transla- memory ordering verification, and it broad- tion and memory ordering. Even machines ens the very scope of memory ordering as a that implemented the strictest model they field. The full toolset is open source.10 considered—virtual address sequential con- sistency (SC-for-VAMC)—may be prone to surprising ordering bugs related to the check- Enhanced Litmus Tests ing of metadata at a different virtual and Litmus tests are small stylized programs test- physical address from the data being accessed. ing some aspect of a memory model. Each We therefore coin the term memory transis- test proposes an outcome: the value returned tency model to refer to any set of memory by each load plus the final value at each ordering rules that explicitly account for these memory location, or some relevant subset broader virtual-to-physical address transla- thereof. The rules of a memory model deter- tion issues. mine whether an outcome is permitted or To enable rigorous analysis of transistency forbidden. Consider Figure 1a: as written, x models and their implementations, we devel- and y appear to be distinct addresses. Under oped a tool called COATCheck for verifying that assumption, the proposed outcome is memory ordering enforcement in the context observable even under a model as strict as of virtual-to-physical address translation. sequential consistency (SC),11 because the (COAT stands for consistency ordering and event interleaving shown in Figure 1b produ- address translation.) COATCheck lets users ces that outcome. If instead x and y are reason about the ordering implications of sys- actually synonyms (both map to the same tem calls, interrupts, microcode, and so on at physical address), as in Figure 1c, the test is the microarchitecture, architecture, and OS forbidden by SC, because then no interleav- levels. System models are built in COAT- ing of the threads produces the proposed out- Check using a domain-specific language come. While simple, this example highlights (DSL) called lspec (pronounced “mu-spec”), how memory ordering verification is funda- within which each component in a system mentally incomplete unless it explicitly (for example, each pipeline stage, each cache, accounts for address translation. and each TLB) can independently specify its The basic unit of testing in COATCheck own contribution to memory ordering using is the enhanced litmus test (ELT). ELTs the languages of first-order logic and micro- extend traditional litmus tests by adding architecture-level happens-before (lhb) address translation, memory (re)mappings, graphs.8,9 This allows COATCheck verifica- interrupts, and other system-level operations tion to be modular and flexible enough to relevant to memory ordering. In addition, adapt to the fast-changing world of modern just as a traditional litmus test outcome speci- heterogeneous systems. fies the values returned by loads, ELTs also ...... MAY/JUNE 2017 89 ...... TOP PICKS

Initially: [x] = 0, [y] = 0 Initially: [x] = 0, [y] = 0 Initially: [x] = 0, [y] = 0 Thread 0 Thread 1 Thread 0 Thread 1 Thread 0 Thread 1 St [x] 1 St [y] 2 St PA1 1 St PA2 2 St PA1 1 St PA1 2 Ld [y] r1 Ld [x] r2 Ld PA2 r1 Ld PA1 r2 Ld PA1 r1 Ld PA1 r2

Proposed outcome: r1 = 2, r2 = 1 Outcome: r1 = 2, r2 = 1 permitted Outcome r1 = 2, r2 = 1 forbidden (a) (b) (c)

Figure 1. A litmus test showing how virtual memory interacts with memory ordering. (a) Litmus test code. (b) A possible execution showing how the proposed outcome is observable if x and y point to different physical addresses. (c) The outcome is forbidden if x and y point to the same physical address (only one possible interleaving among many is shown).

consider the physical addresses used by each dler (Thread 1b), which performs its own VM access to be part of the outcome. local TLB invalidation before responding to Finally, ELTs include “ghost instructions” the initiating thread. that model lower-than-ISA operations (such as microcode and page table walks) Microarchitecture Synopses executed by hardware, even if these instruc- As with the OS synopses, microarchitecture tions are not fetched, decoded, or issued as synopses map each instruction onto a micro- part of the normal ISA-level instruction code sequence that includes ghost instruc- stream. These features give ELTs sufficient tions such as page table walks. Not every expressive power to test all aspects of mem- instruction actually triggers a page table walk, ory ordering enforcement as it relates to so these ghost instructions are instantiated address translation. only as needed during the analysis. The COATCheck toolflow provides auto- For example, Figure 2b is transformed mated methods to create ELTs from user- into the ELTof Figure 2c by the addition of provided litmus tests plus other system-level the gray-shaded ghost instructions. In this annotation guidance. We describe this con- example, Thread 0’s store to [x] requires a version process below. page table walk, because the TLB entry for that virtual address would have been invali- OS Synopses dated by the preceding invlpg instruction. OS activities such as TLB shootdowns and Furthermore, because the page was originally memory (re)mappings are captured within clean, ghost instructions also model how ELTs as sequences of loads, stores, system hardware marks the page dirty at that point. calls, and/or interrupts. An OS synopsis Finally, the microarchitecture synopsis adds specifies a mapping from each system call to Thread 1b a microcode preamble contain- into a sequence of micro-ops that capture the ing ghost instructions to receive the interrupt, effects of that system call on ordering and save state, and disable nested interrupts. In address translation. When the system call this example, hardware is responsible for sav- contains an interprocessor interrupt (IPI), ing state, but software is responsible for the OS synopsis also instantiates predefined restoring it. This again highlights the degree interrupt handler threads on interrupt- of collective responsibility between hardware receiving cores. and OS for ensuring ordering correctness. Forexample,anOSsynopsismight expand the mprotect call of Figure 2a into the shaded instructions of Figure 2b. The call lspec: A DSL for Specifying Memory itself expands into four instructions: one Orderings updates the page table entry, one invalidates lspec is a domain-specific language for speci- the local TLB, one sends an IPI, and one fying memory ordering enforcement in the waits for the IPI to be acknowledged. The form of lhb graphs8,9 (see Figure 3). Nodes OS synopsis also produces the interrupt han- in a lhb graph represent events corresponding ...... 90 IEEE MICRO to a particular instruction (column) and some particular microarchitectural event Initially: [x] = 0, [y] = 0 (row). Edges represent happens-before order- Core 0/Thread 0 Core 1/Thread 1a ings guaranteed by some property of the mprotect [x], r/w microarchitecture: an instruction flowing St [x] ← 1 St [y] ← 1 Ld [y] → 0 → through a pipeline, a structure maintaining Ld [x] 0 Depicted outcome permitted first-in, first-out (FIFO) ordering, the pas- sage of a message, and so on. Although pre- (a) vious work derived lhb graphs using hard- Initially: [x] = 0, [y] = 0 8 9 coded notions of pipelines and caches, Core 0/Thread 0 Core 1/Thread 1a Core 1/Thread 1b lspec models provide a completely general- St [z/PTE (x)] ← R/W St [y] ← 1 invlpg [x] purpose language for drawing lhb graphs invlpg [x] Ld [x] → 0 Send Ack tailored to any arbitrary system design. We Send IPI iret Wait for Acks provide a detailed example of the lspec syn- St [x] ← 1 tax in the next section. Ld [y] → 0 Hardware memory models today tend to Depicted outcome permitted be either axiomatic, where an outcome is (b) permitted if and only if it simultaneously sat- Initially: [x] = 0, [y] = 0 isfies all of the axioms of the model, or opera- Core 0/Thread 0 Core 1/Thread 1a Core 1/Thread 1b tional, where an outcome is permitted only St [z/PTE (x)] ← R/W St [y] ← 1 IPI Receive if it matches the outcome of some series of invlpg [x] Ld [x] → 0 Save state Send IPI Ld PML4E (x) disable ints execution steps on an abstract “golden hard- Wait for Acks Ld PDPTE (x) invlpg [x] ware model.” lspec models are axiomatic: a Ld PML4E (x) Ld PDE (x) Send ACK lhb graph represents a legal test execution if Ld PDPTE (x) Ld PTE (x) iret Ld PDE (x) and only if it is acyclic and satisfies all of the LdAtomic PTE (x) → clean constraints in the model. Each hardware or StAtomic PTE (x) ← dirty software component designer provides an St [x] ← 1 → independent set of lhb graph axioms which Ld [y] 0 that component guarantees to maintain. The Depicted outcome permitted conjunction of these axioms forms the overall (c) lspec model. This modularity means that components can be added, changed, or Figure 2. Traditional litmus tests are expanded into enhanced litmus tests removed as necessary without affecting any (ELTs). (a) A traditional litmus test with an mprotect system call added. of the other components. (b) The userþkernel version of the litmus test. On core 1, threads 1a and 1b Although they are inherently axiomatic, will be interleaved dynamically. “R/W” indicates that the page table entry lspec models capture the best of the opera- (PTE) R/W bit will be set. (c) The ELT. Page table accesses for [y], tional approach as well. A total ordering of accessed bit updates, and so forth are not depicted but will be included in the nodes in an acyclic lhb graph is also anal- the analysis. ogous to the sequence of execution steps in an operational model. This analogy lets lhb graphs retain much of the intuitiveness of then checked against the architecture-level 1 operational models while simultaneously specification to ensure correctness. retaining the scalability of axiomatic models. As such, lhb graphs are useful not only for System Model Case Study transistency models but also more generally In this section, we present an in-depth case for software and hardware memory models. study of how hardware and software design- The COATCheck constraint solver is ers can use COATCheck and lspec to model inspired by SAT and SMT solvers. It searches a high-performance out-of-order processor to find any lhb graph that satisfies all of the and OS. Our case study has three parts. The constraints of a given lspec model applied to first is a lspec model called SandyBridge that some ELT. If one can be found, the proposed describes an out-of-order processor based on ELT outcome is observable. If not, the pro- public documentation of and patents relating posed outcome is forbidden. This result is to Intel’s Sandy Bridge microarchitecture...... MAY/JUNE 2017 91 ...... TOP PICKS

The second is the microarchitecture synopsis, Once the TLB provides it, the physical which specifies how ghost instructions such address is written into the SB as well. Each as page table walks behave on SandyBridge. load, in parallel with accessing the TLB, The third is an OS synopsis inspired by writes the lower 12 bits (the “index bits”) of Linux’s implementations of system calls and its virtual address into a CAM-based load interrupt handlers. We offer in-depth model buffer storing uncommitted loads. The load highlights in this article; see our full paper for then compares its index bits against those of additional detail.12 all older stores in the SB. If an index match is found, the load then compares its virtual tag, Memory Dependency Prediction and and potentially its physical tag, against the Disambiguation stores. If there is a tag match, the youngest SandyBridge uses a sophisticated, high- matching store forwards its data to the load. performance virtually and physically addressed If no match is found, the load proceeds to store buffer (SB). This decision was inten- the cache. If there is an empty slot because tional: a virtual-only SB would be unable to the load executed out of order before an ear- detect virtual address synonyms, whereas a lier store, then the load predicts that there is physical-only SB would place the TLB onto no dependency. This prediction is later the critical path for SB forwarding. The Sandy- checked during disambiguation: before each Bridge SB instead splits the forwarding process store commits, it checks the load buffer to see into two parts: a prediction stage tries to pre- if any younger loads matching the same phys- emptively anticipate physical address matches, ical address have speculatively executed and a disambiguation stage later ensures that before it. If so, it squashes and replays those all predictions were correct. This pairing keeps mispredicted loads. the TLB off the critical path without giving up The following lspec snippet shows a por- the ability to detect synonyms. tion of the SandyBridge lspec model captur- The mechanism works as follows. All ing a case in which a load has an index stores write their virtual address and data into match, a virtual tag miss, and a physical tag the SB in parallel with accessing the TLB. match with a previous store.

DefineMacro “StoreBufferForwardPTag”: exists microop “w”, ( SameCore w i /\IsAnyWrite w /\ProgramOrder w i /\ SameIndex w i /\~(SameVirtualTag w i) /\ SamePhysicalTag w i /\SameData w i/\ EdgesExist [ ((w, SB-VTag/Index/Data), (i, LB-SB-IndexCompare), “SBEntryIndexPresent”); ((w, SB-PTag), (i, LB-SB-PTagCompare), “SBEntryPTagPresent”); ((i, SB-LB-DataForward), (w, (0, MemoryHierarchy)), “BeforeSBEntryLeaves”); ((i, LB-SB-IndexCompare), (i, LB-SB-VTagCompare), “path”); ((i, LB-SB-VTagCompare), (i, LB-SB-PTagCompare), “path”); ((i, LB-PTag), (i, LB-SB-PTagCompare), “path”); ((i, LB-SB-PTagCompare), (i, SB-LB-DataForward), “path”); ((i, SB-LB-DataForward), (i, WriteBack), “path”) ] /\ ExpandMacro STBNoOtherMatchesBetweenSrcAndRead

).

...... 92 IEEE MICRO The first set of predicates narrows the axiom down to apply to the scenario we described ear- Core 0/Thread 0 Core 0/Thread 1a lier. The edges listed in the EdgesExist St[x]←1 Ld[y]→0 St[y]←1 Ld[x]→0 predicate then describe the associated memory Fetch ordering constraints. The first three ensure that write w is still in the SB when load i Decode searches for it, and the rest describe the path Execute that i itself takes through the microarchitec- ture. Finally, the axiom also checks (using a Memory macro defined elsewhere) that the store is in Writeback fr fact the youngest matching store in the SB. LeaveStoreBuffer fr Other Model Details MemoryHierarchy A second component of our SandyBridge model reflects the functionality of system calls Figure 3. A lhb graph for the litmus test in Figure 1 (minus the and interrupts as they relate to memory map- mprotect), executing on a simple five-stage out-of-order pipeline. ping and remapping functions. Although x86 Because the graph is acyclic, the execution is observable. TLB lookups and page table walks are per- formed by the hardware, x86 TLB coher- ence is OS-managed. To support this, x86 constrained following the value-in-cache-line provides the privileged invlpg instruc- (ViCL) mechanism.9 All loads and stores tion, which invalidates the local TLB entry (including ghost instructions) are constrained at a given address, along with support for by the model to access the TLB within the interprocessor interrupts (IPIs). As a serial- lifetime of some matching TLB entry. izing instruction, invlpg forces all pre- Page table walks are also instantiated by vious instructions to commit and drains the the microarchitecture synopsis as a set of SB before fetching the following instruc- ghost instruction loads of the page table tion. invlpg also ensures that the next entry. Because these are generated by dedi- access to the virtual page invalidated will be cated hardware, the SandyBridge lspec a TLB miss, thus forcing the latest version model does not draw nodes such as Fetch of the corresponding page table entry to be and Dispatch for these instructions, because brought into the TLB. they do not pass through the pipeline. Fur- To capture IPIs and invlpg instructions, thermore, because the page table walk loads our Linux OS synopsis expands the system are not TSO-ordered, they do not search the call mprotect into code snippets that load buffer. They are, however, ordered with update the page table, invalidate the now- respect to invlpg. stale TLB entry on the current core, and send Our SandyBridge model also captures the TLB shootdowns to other cores via IPIs and accessed and dirty bits present in the page interrupt handlers that execute invlpg table and TLB. When an accessed or dirty bit operations on the remote cores. The Sandy- needs to be updated, the pipeline waits until Bridge microarchitecture synopsis captures the triggering instruction reaches the head of interrupts by adding ghost instructions that the reorder buffer. At that point, the pro- represent the reception of the interrupt and cessor injects microcode (modeled via ghost the hardware disabling of nested interrupts LOCKed read-modify-write [RMW] before each interrupt handler. All possible instructions) implementing the update. The interleavings of the interrupt handlers and ghost instructions in a status bit update do the threads’ code are considered. Figures 2b traverse the Dispatch, Issue, and Commit and 2c depict the effects of both of these stages, unlike the ghost page table walks, synopses. because the status bit updates do propagate To model TLB occupancy, the Sandy- through most of the pipeline and affect archi- Bridge lspec model adds two nodes to the tectural state. The model also uses lhb edges lhb graph to represent TLB entry creation to ensure that the update is ordered against and invalidation, respectively. These are then all other instructions...... MAY/JUNE 2017 93 ...... TOP PICKS

situation on SandyBridge. Figure 4b shows i0.0i0.1i1.0 i1.1 i2.0 i2.1 i3.0 i3.1 the code itself. If load (i3.0) executes out Fetch of order, it finds that the SB contains no Dispatch previous entries with the same index; this is Issue captured by a lhb edge between (i3.0, AGU LB-SB-IndexCompare) and (i2.0, AccessTLB SB-VTag/Index/Data). However, when TLBEntryCreate the store (i2.0) does eventually execute, it TLBEntryInvalidate will squash (i3.0) unless the load buffer has SB-VTag/Index/Data i3.0 LB-Index no index matches—that is, if ( ) has not LB-SB-IndexCompare yet entered the load buffer. The lhb edge LB-SB-VTagCompare from (i2.0, LBSearch)backto(i3.0, SB-PTag LB-Index) completes the cycle, which rules LB-PTag out the execution. LB-SB-PTagCompare SB-LB-DataForward Page Remappings AccessCache Figure 5 reproduces and extends the key CacheLineInvalidated example studied by Bogdan Romanescu and WriteBack colleagues:7 thread 0 changes the mapping LBSearch for x (i0.0), triggers a TLB shootdown Commit (i2.0), and sends a message to thread 1 LeaveStoreBuffer (i4.0). Thread 1 receives the message MemoryHierarchy (i7.0) and is hypothesized to write to x (a) (i8.0) using the old, stale mapping (a situa- Initially: [x] = 0, [y] = 0 tion COATCheck should be expected to rule VA x → PA a (R/W, accessed, dirty) out). Thread 1 (i9.0) sends a message back VA y → PA a (R/W, accessed, dirty) to thread 0 (i5.0), which checks (i6.0) Core 0/Thread 0 Core 1/Thread 1 that the value at x (according to the new map- (i0.0) St [x/a] ← 1 (i2.0) St [y/a] ← 2 ping) was not overwritten by the thread 1 (i0.1) Ld PTE [x] (i2.1) Ld PTE (y) store (i8.0),whichusedtheoldmapping. (i1.0) Ld [y/a] → r1 (i3.0) Ld [x/a] → r2 The lhb graph generated for this scenario (i1.1) Ld PTE (y) (i3.1) Ld PTE [x] (Figure 5a) is also cyclic, showing how COAT- Outcome r1 = 2, r2 = 1 forbidden Check does in fact rule out the execution of (b) Figure 5b. The graph also simultaneously demonstrates many COATCheck features, Figure 4. Analyzing litmus test n5 with COATCheck. (a) The lhb graph, with such as IPIs, handlers, microcode, and fences, the cycle shown with thicker edges. (b) The ELT code. and it shows COATCheck’s ability to scale up to large and highly nontrivial problems. Analysis and Verification Examples Transistency versus Consistency In this section, we present three test cases for Our third example focuses on status bits and our SandyBridge model. synonyms. Status bits are tracked per virtual- to-physical mapping rather than per physical Store Buffer Forwarding page, and so the OS is responsible for track- Test n5 (see Figure 4) checks the SB’s ability ing the status of synonyms. In this example, to detect synonyms. If a synonym is misde- suppose the OS intends to swap out to disk a tected, one of the loads (i1.0 or i3.0) clean page that is a synonym of some dirty might be allowed to bypass the store (i0.0 page. If it fails to check the status bits for that or i2.0) before it, leading to an illegal out- synonym, it might think that the page is come. Also pictured are the TLB access ghost clean and hence that it can be safely swapped instructions associated with each ISA-level out without being first written back. instruction. Figure 4a shows one of the lhb Notably, in this example, the bug may be graphs COATCheck uses to rule out such a observable even when there is no reordering ...... 94 IEEE MICRO i0.0 i0.1i1.0 i2.0i2.1i3.0 i4.0 i5.0 i6.0 i6.1 i7.0 i8.0 i8.1i9.0 i10.0i11.0 i12.0 i13.0 i14.0 i15.0 Fetch Dispatch Issue AGU AccessTLB TLBEntryCreate TLBEntryInvlidate SB-VTag/Index/Data LB-Index LB-SB-IndexCompare LB-SB-VTagCompare SB-PTag LB-PTag LB-SB-PTagCompare SB-LB-DataForward AccessCache CacheLineInvalidated WriteBack LBSearch Commit LeaveStoreBuffer MemoryHierarchy (a) Initially: [x] = 0, VA x → PA a (R/W, accessed, dirty) (other initial mapping not shown) Core 0 Core 1 Thread 0 Thread 1a (i0.0) St [z/PTE (x)] ← (i7.0) Ld [y/c] → 2 (VA x → PA b) (i8.0) St [x/a] ← 3 (i0.1) Ld PTE [x] (i8.1) Ld PTE [x] → TLB (i1.0) invlpg [x] (i9.0) St [y/c] ← 4 (i2.0) St [w/APIC] ← mrf Thread 1b (i2.1) Ld PTE(w) → TLB (i10.0) Ld [w/APIC] → mrf (i3.0) Ld [v/d] → ack (i11.0) Ld EFLAGS → (IF) (i4.0) St [y/c] ← 2 (i12.0) St EFLAGS ← (!IF) (i5.0) Ld [y/c] → 4 (i13.0) invlpg [x] (i6.0) Ld [x/b] → 1 (i14.0) St [v/d] ← ack (i6.1) Ld PTE [x] → TLB (i15.0) iret Depicted outcome forbidden (b)

Figure 5. Litmus test ipi8.7 (a) Because the graph is cyclic (thick edges), the outcome is forbidden. In this case, the cycle was found before the PTEs for y were even enumerated. (b) The ELT code.

of any kind taking place, even under virtual- SandyBridge model (including the case stud- and/or physical-address sequential consis- ies discussed earlier). On an Intel Xeon E5- tency.7 Because the checks of the two syno- 2667-v3 CPU, all 118 tests completed in nym page mappings are to different virtual fewer than 100 seconds, and many were even and physical addresses, the necessary ordering faster. Although these lhb graphs are often cannot even be described by VAMC. This an order of magnitude larger than those example highlights a key way in which tran- studied by prior tools analyzing lhb sistency models are inherently broader in graphs,8,9 the runtimes are similar. This dem- scope than consistency models. onstrates the benefits of combining the lspec We tested COATCheck on 118 litmus DSL with an efficient dedicated solver. It also tests, many of which come from Intel and points to the feasibility of providing transis- AMD manuals and from prior work,1 and tency verification fast enough to support others that are handwritten to stress the interactive design and debugging...... MAY/JUNE 2017 95 ...... TOP PICKS

ith COATCheck, we were able to Proc. 29th ACM SIGPLAN Conf. Program- W successfully identify, model, and ver- ming Language Design and Implementa- ify a number of interesting scenarios at the tion, 2008, pp. 68–78. intersection of memory consistency models 4. S. Sarkar et al., “Understanding POWER and address translation. However, many Multiprocessors,” Proc. 32nd ACM SIG- important challenges remain; COATCheck PLAN Conf. Programming Language Design only scratches the surface of the complete set and Implementation, 2011, pp. 175–186. of phenomena that can arise at the OS and 5. P. Sewell et al., “x86-TSO: A Rigorous and microarchitecture layers. For example, a nat- Usable Programmer’s Model for x86 Multi- ural next step might be to extend COAT- processors,” Comm. ACM, vol. 53, no. 7, Check to model virtual machines and 2010, pp. 89–97. hypervisors of arbitrary depth. Generally, we 6. M. Clark, “A New, High Performance x86 hope and expect that future work in the area Core Design from AMD,” Hot Chips 28 will build on top of COATCheck to create Symp., 2016; www.hotchips.org/archives more complete and more rigorous transistency /2010s/hc28. models that can capture an ever-growing set of system-level behaviors and bugs. 7. B. Romanescu, A. Lebeck, and D.J. Sorin, We also envision COATCheck becoming “Address Translation Aware Memory Con- more integrated with top-to-bottom memory sistency,” IEEE Micro, vol. 31, no. 1, 2011, ordering verification tools. We hope that one pp. 109–118. day verification tools will cohesively span the 8. D. Lustig, M. Pellauer, and M. Martonosi, full computing stack, from programming “Verifying Correct Microarchitectural Enforce- languages all the way down to register trans- ment of Memory Consistency Models,” IEEE fer level, thereby giving programmers and Micro, vol. 35, no. 3, 2015, pp. 72–82. architects much more confidence in the cor- 9. Y. Manerkar et al., “CCICheck: Using lhb rectness of their code and systems. These Graphs to Verify the Coherence-Consis- goals will only become more challenging as tency Interface,” Proc. 48th Int’l Symp. systems grow more heterogeneous and more Microarchitecture, 2015, pp. 26–37. complex over time. However, COATCheck 10. Check Verification Tool Suite; http://check provides a rigorous and scalable roadmap for .cs.princeton.edu. understanding how such systems can be 11. L. Lamport, “How to Make a Multiprocessor understood rigorously, and as such we hope Computer That Correctly Executes Multiproc- that future work finds COATCheck and its ess Programs,” IEEE Trans. Computers,vol. lspec modeling language to be useful build- 28, no. 9, 1979, pp. 690–691. ing blocks for continued research into the area. MICRO 12. D. Lustig et al., “COATCheck: Verifying Memory Ordering at the Hardware-OS ...... Interface,” Proc. 21st Int’l Conf. Architec- References tural Support for Programming Languages 1. J. Alglave, L. Maranget, and M. Tautschnig, and Operating Systems, 2016, pp. 233–247. “Herding Cats: Modelling, Simulation, Test- ing, and Data Mining for Weak Memory,” Daniel Lustig is a research scientist at Nvi- ACM Trans. Programming Languages and dia. His research interests include computer Systems, vol. 36, no. 2, 2014; doi:10.1145 architecture and memory consistency mod- /2627752. els. Lustig received a PhD in electrical engi- 2. M. Batty et al., “Clarifying and Compiling C/ neering from Princeton University, where he Cþþ Concurrency: From Cþþ 11 to performed the work for this article. He is a POWER,” Proc. 39th Ann. ACM SIGPLAN- member of IEEE and ACM. Contact him at SIGACT Symp. Principles of Programming [email protected]. Languages, 2012, pp. 509–520. 3. H.-J. Boehm and S.V. Adve, “Foundations Geet Sethi is a PhD student in the Depart- of the Cþþ Concurrency Memory Model,” ment of Computer Science at Stanford ...... 96 IEEE MICRO University. His research interests include mobile computing, with an emphasis on serverless computing, machine learning, and power-efficient heterogeneous systems. computer architecture. Sethi received a BS Martonosi has a PhD in electrical engineer- in computer science and mathematics from ing from Stanford University. She is a Fellow Rutgers University, where he performed the of IEEE and ACM. Contact her at mrm@ work for this article. He is a student member princeton.edu. of IEEE and ACM. Contact him at geet@ cs.stanford.edu.

Abhishek Bhattacharjee is an associate pro- fessor in the Department of Computer Sci- ence at Rutgers University. His research interests span the hardware–software inter- face. Bhattacharjee received a PhD in electri- cal engineering from Princeton University. He is a member of IEEE and ACM. Contact him at [email protected].

Margaret Martonosi is the Hugh Trumbull Adams ’35 Professor of Computer Science Read your subscriptions through at Princeton University. Her research inter- the myCS publications portal at http://mycs.computer.org. ests include computer architecture and

...... MAY/JUNE 2017 97 ...... TOWARD A DNA-BASED ARCHIVAL STORAGE SYSTEM

......

STORING DATA IN DNA MOLECULES OFFERS EXTREME DENSITY AND DURABILITY

ADVANTAGES THAT CAN MITIGATE EXPONENTIAL GROWTH IN DATA STORAGE NEEDS.THIS

ARTICLE PRESENTS A DNA-BASED ARCHIVAL STORAGE SYSTEM, PERFORMS WET LAB

EXPERIMENTS TO SHOW ITS FEASIBILITY, AND IDENTIFIES TECHNOLOGY TRENDS THAT James Bornholt POINT TO INCREASING PRACTICALITY. Randolph Lopez ...... The “digital universe” (all digital involve established biotechnology practices. University of Washington data worldwide) is forecast to grow to more The write process encodes digital data into than 16 zettabytes in 2017.1 Alarmingly, this DNA nucleotide sequences (a nucleotide is exponential growth rate easily exceeds our the basic building block of DNA), synthe- Douglas M. Carmean ability to store it, even when accounting for sizes (manufactures) the corresponding DNA Microsoft forecast improvements in storage technolo- molecules, and stores them away. Reading gies such as tape (185 terabytes2) and optical the data involves sequencing (reading) the media (1 petabyte3). Although not all data DNA molecules and decoding the informa- Luis Ceze requires long-term storage, a significant frac- tion back to the original digital data (see tion does: Facebook recently built a datacen- Figure 1). Georg Seelig ter dedicated to 1 exabyte of cold storage.4 Progress in DNA storage has been rapid: Synthetic (manufactured) DNA sequen- in our ASPLOS paper, we successfully stored University of Washington ces have long been considered a potential and recovered 42 Kbytes of data; since publi- medium for digital data storage because of cation, our team has scaled our process to their density and durability.5–7 DNA mole- store and recover more than 200 Mbytes of Karin Strauss cules offer a theoretical density of 1 exabyte data.10,11 Constant improvement in the scale per cubic millimeter (eight orders of magni- of DNA storage—at least two times per Microsoft Research tude denser than tape) and half-life durability year—is fueled by exponential reduction in of more than 500 years.8 DNA-based storage synthesis and sequencing cost and latency; also has the benefit of eternal relevance: as growth in sequencing productivity eclipses long as there is DNA-based life, there will be even Moore’s law.12 Further growth in the strong reasons to read and manipulate DNA. biotechnology industry portends orders of Our paper for the 2016 Conference on magnitude cost reductions and efficiency Architectural Support for Programming Lan- improvements. guages and Operating Systems (ASPLOS) We think the time is ripe to seriously con- proposed an architecture for a DNA-based sider DNA-based storage and explore system archival storage system.9 Both reading and designs and architectural implications. Our writing a synthetic DNA storage medium ASPLOS paper was the first to address two ......

98 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE fundamental challenges in building a viable DNA-based storage system. First, how should such a storage medium be organized? We Write path Read path demonstrate the tradeoffs between density, reliability, and performance by envisioning AGTCACT AGTCACT DNAstorageasakey-valuestore.Multiple 01010111 01010111 key-value pairs are stored in the same pool, Encoding Synthesis Sequencing Decoding and multiple such pools are physically arranged into a library. Second, how can data Figure 1. Using DNA for digital data storage. Writes to DNA first encode be recovered efficiently from a DNA storage digital data as nucleotide sequences and then synthesize (manufacture) system? We show for the first time that ran- molecules. Reads from DNA first sequence (read) the molecules and then dom access to DNA-based storage pools is decode back to digital data. feasible by using a polymerase chain reaction (PCR) to amplify selected molecules for sequencing. Our wet lab experiments validate our approach and point to the long-term via- bility of DNA as an archival storage medium.

Data DNA storage library in System Design DNA synthesizer DNA A DNA storage system (see Figure 2) takes pool data as input, synthesizes DNA molecules to PCR represent that data, and stores them in a thermocycler library of pools. To read data back, the sys- Data tem selects molecules from the pool, ampli- out DNA sequencer fies them with PCR (a standard process from biotechnology), and sequences them back to digital data. We model the DNA storage sys- Figure 2. Overview of a DNA storage system. Stored molecules are tem as a key-value store, in which input data arranged in a library of pools. is associated with a key, and read operations identify the key they wish to recover. molecules. This blocking approach also ena- Writing to DNA storage involves encod- bles added redundancy. Previous work over- ing binary data as DNA nucleotides and syn- lapped multiple small blocks,7 but our thesizing the corresponding molecules. This experimental and simulation results show process involves two non-trivial steps. First, this approach to sacrifice too much density although there are four DNA nucleotides for little gain. Our ASPLOS experiments (A, C, G, T) and so a conversion from binary instead used an XOR encoding, in which appears trivial, we instead convert binary each consecutive pair of blocks is XORed data to base 3 and employ a rotating encod- together to form a third redundancy block. ing from ternary digits to nucleotides.7 This Although this encoding is simple, we showed encoding avoids homopolymers—repetitions that it achieves similar redundancy properties of the same nucleotide—that significantly to existing approaches with much less density increase the chance of errors. overhead. Since publishing this paper, our Second, DNA synthesis technology effec- team has been exploring more sophisticated tively manufactures molecules one nucleotide encodings, such as Reed-Solomon codes. at a time, and cannot synthesize molecules of arbitrary length without error. A reasonably efficient strand length for DNA synthesis is Random Access 120 to 150 nucleotides, which gives a maxi- Reading from DNA storage involves sequenc- mum of 237 bits of data in a single molecule ing molecules and decoding their data back to using this ternary encoding. The write proc- binary (using the inverse of the encoding dis- ess therefore fragments input data into small cussed earlier). In existing work on DNA stor- blocks that correspond to separate DNA age, recovering data meant sequencing all ...... MAY/JUNE 2017 99 ...... TOP PICKS

Our design allows for random access by using PCR, shown in Figure 4. The read process first determines the primers for the given key (analogous to a hash function) and Input nucleotides TCTACGCTCGAGTGATACGAATGCGTCGTACTACGTCGTGTACGTA... synthesizes them into new DNA molecules.

Output strand 5’ TCTACGATC A TCTACGCTCGAGTGATACGA TCTACG A CCAGTATCA 3’ Then, rather than applying sequencing to the Primer SPayload Address S Primer entire pool of stored molecules, we first apply target target PCR to the pool using these primers. PCR amplifies the strands in the pool whose pri- Figure 3. Layout of individual DNA strands. Each strand must carry an mers match the given ones, creating many explicit copy of its address, because DNA molecules do not offer the spatial copies of those strands. To recover the file, we organization of traditional storage media. now take a sample of the product pool, which contains a large number of copies of all the relevant strands but only a few other irrele- vant strands. Sequencing this sample there- fore returns the data for the relevant key rather than all data in the system. PCR Sample Although PCR-based random access is a viable implementation, we don’t believe it is practical to put all data in a single pool. We instead envision a library of pools offering spatial isolation. We estimate each pool to contain about 100 Tbytes of data. An address then maps to both a pool location and a PCR primer. Figure 5 shows how the random access described earlier fits in a system with a Figure 4. Polymerase chain reaction (PCR) amplifies selected strands to library of DNA pools. This design is analo- provide efficient random access. The resulting pool after sampling contains gous to a magnetic-tape storage library, in primarily the strands of interest. which robotic arms are used to retrieve tapes. In our proposed DNA-based storage system, DNA pools could be manipulated and neces- synthesized molecules and decoding all data sary reactions could be automated by fluidics at once. However, a realistic storage system systems. must offer random access—the ability to select individual files for reading—if it is to be practical at large capacities. Wet Lab Experiments Because DNA molecules do not offer spa- To demonstrate the feasibility of DNA stor- tial organization like traditional storage age with random access, we encoded and had media, we must explicitly include addressing DNA molecules synthesized for four image information in the synthesized molecules. files totaling 151 Kbytes. We then selectively Figure 3 shows the layout of an individual recovered 42 Kbytes of this image data using DNA strand in our system. Each strand con- our random access scheme. We used both an tains a payload, which is a substring of the existing encoding7 and our XOR encoding. input data to encode. An address includes We were able to recover files encoded with both a key identifier and an index into the XOR with no errors. Using the previously input data (to allow data longer than one existing encoding resulted in a 1-byte error. strand). At each end of the strand, special pri- In total, the encoded files required 16,994 mer sequences—which correspond to the key DNA strands, and sequencing produced a identifier—allow for efficient sequencing total of 20.8 million reads of those strands during read operations. Finally, two sense (with an average of 1,223 reads per DNA nucleotides (“S”) help determine the direc- strand, or depth of 1,223). tion and complementarity of the strand dur- To explore the impact of lower sequenc- ing sequencing. ing depth on our results, we performed an ...... 100 IEEE MICRO Write process

Select primers for key

ATGTTGGATGCAC AAAACATCC foo.jpg Manufacture DNA ATGTTT GCTT ACC AAACCATCC DNA storage 01010111... (physical) library ATGTTGCCAGT TC AAAGCATCC

Encode data

Read process PCR amplification

ATGTTGGATGCAC AAAACATCC Look up and Decode Sequencing 01010111... ATGTTT GCTT ACC AAACCATCC determine primer ATGTTGCCAGT TC AAAGCATCC foo.jpg

Figure 5. Putting it all together: random access with a pool library for physical isolation. The key data (here, foo.jpg) is used with a hash function to identify the relevant pool within the library. experiment in which we discarded much of the sequencing data (see Figure 6). Lower 100 depth per DNA sequence frees up additional sequencing bandwidth for other DNA sequences, but could omit some strands 75 entirely if they are not sequenced at all. Encoding Despite such omissions, the results show that 50 Goldman we can successfully recover all data using as XOR few as 1 percent of the sequencing results, indicating we could have recovered 100 25 times more data with the same sequencing Per−base accuracy (%) technology. Future sequencing technology is 0 likely to continue increasing this amount. 0.01 0.1 1 10 To inform our coding-scheme design, Reads used (%) we assessed errors in DNA synthesis and sequencing by comparing the sequencing Figure 6. Decoding accuracy as a function of sequencing depth. We output of two sets of DNA sequences with successfully recover all data using as little as 1 percent of the sequencing the original reference data. The first set results, suggesting current sequencing technology can recover up to 100 includes the sequences we used to encode times more data. data, which were synthesized for our storage experiments by a supplier using an array array synthesis (the difference between the method. Errors in these sequencing results two sets). Our results indicate that overall could be caused either by sequencing or syn- errors per base are a little more than 1 percent thesis (or both). The second set includes and that sequencing accounts for most of the DNA that was synthesized by a different sup- error (see Figure 7). plier using a process that’s much more accu- rate (virtually no errors), but also much costlier. Errors in these sequencing results are Technology Trends essentially caused only by the sequencing With demand for storage growing faster process. By comparing the two sets of results, than even optimistic projections of current we can determine the error rate of both technologies, it is important to develop new sequencing (results from the second set) and sustainable storage solutions. A significant ...... MAY/JUNE 2017 101 ...... TOP PICKS

than room temperature). As we showed in ACTGCCT our work, DNA can also support random access, allowing most data to remain at rest until needed. Array synthesis Column synthesis 1.0 Synthesis - Cheap - Expensive error Current DNA technologies do not yet - High error - Zero error offer the throughput necessary to support a practical system—in our experiments, Sequencing 0.5 throughput was on the order of kilobytes per Sequencing error week. But a key reason for choosing DNA Error analysis as storage media, rather than some other bio-

Errors due to per base (%) error Average molecule, is that there is already significant Errors due only to synthesis and sequencing 0 momentum behind improvements to DNA sequencing Array Column manipulation technology. The Carlson curves in Figure 8 compare progress in DNA manip- Figure 7. Analysis of error from synthesis and sequencing. Overall errors per ulation technology (both sequencing and base are little more than 1 percent and are mostly attributable to synthesis) to improvements in transistor den- sequencing. sity.12 Sequencing continues to keep up with, and sometimes outpace, Moore’s law. New technologies such as nanopore sequencing promisetocontinuethisrateofimprovement Transistors on chip 13 1010 in the future. Reading DNA Writing DNA 108 Future Directions Using DNA for data storage opens many 106 research opportunities. In the short term,

Productivity because DNA manipulation is relatively noisy, 104 it requires coding-theoretic techniques to offer reliable behavior with unreliable components. Our team has been working on adopting 2 10 more sophisticated encoding schemes and 1970 1980 1990 2000 2010 better calibrating them to the stochastic Year behavior of molecular storage. DNA storage also involves much higher access latency Figure 8. Carlson curves compare trends in DNA synthesis and sequencing 12 than digital storage media, suggesting new to Moore’s law. Recent growth in sequencing technology outpaces research opportunities in latency hiding and Moore’s law. (Data provided by Robert Carlson.) caching. Finally, the compactness of DNA- based storage, together with the necessity for fraction of the world’s data can be stored in wet access to molecules, could open new archival form. For archival purposes, as long datacenter-level organizations and automation as there is enough bandwidth to write and opportunities for biological manipulation. read data, latency can be high, as is the case In the long term, a last layer of the storage for DNA data storage systems. hierarchy with unprecedented density and Archival storage should be dense to durability opens up the possibility of storing occupy as little space as possible, be very all kinds of records for extended periods of durable to avoid continuous rewriting opera- time. Figure 9 illustrates a possible hierarchy tions, and have low power consumption at with the properties of each layer. Data that rest because it is meant to be kept for long could be preserved for a long time include periods of time. DNA fulfills all these criteria, both system records, such as search and because it is ultra-dense (1 exabyte per cubic security logs, as well as human records, such inch for a practical system), is very durable as health and historical data in textual, audio, (millennia scale), and has low power require- and video formats. Besides its obvious uses in ments (keep it dark, dry, and slightly cooler disaster recovery, this opportunity could one ...... 102 IEEE MICRO Access time Durability Capacity Flash µs–ms ~5 yrs Tbytes HDD 10s of ms ~5 yrs 100s of Tbytes Tape Minutes ~15–30 yrs Pbytes DNA storage Hours Centuries Zbytes

Figure 9. A possible storage system hierarchy. DNA storage is a promising new bottom layer, offering higher density and durability at the cost of latency. day be a great contributor to the field of digi- this work. This material is based on work tal archeology, the study of human history supported by the National Science Founda- through “ancient” digital data. tion under grant numbers 1064497 and 1409831, by gifts from Microsoft Research, he success of the initial project, pub- and by the David Notkin Endowed Gradu- T lished in our ASPLOS paper, motivated ate Fellowship. us to significantly expand our efforts to explore DNA-based data storage. We formed ...... the Molecular Information Systems Lab References (MISL), with members from the University 1. “Where in the World Is Storage: Byte of Washington and Microsoft Research. Density Across the Globe,” IDC, 2013; MISL has worked with Twist Bioscience to www.idc.com/downloads/where is storage synthesize a 200-Mbyte DNA pool,11 more infographic 243338.pdf. than three orders of magnitude larger than 2. “Sony Develops Magnetic Tape Technology our ASPLOS results, and an order of magni- with the World’s Highest Recording Density,” 14 tude larger than the prior state of the art. press release, Sony, 30 Apr. 2014; www.sony Some of its more recent efforts include new .net/SonyInfo/News/Press/201404/14-044E. coding schemes, sequencing with nanopore- 3. J. Plafke, “New Optical Laser Can Increase based techniques, and fluidics automation. DVD Storage Up to One Petabyte,” blog, Given the impending limits of silicon 20 June 2013; www.extremetech.com technology, we believe that hybrid silicon /computing/159245-new-optical-laser-can and biochemical systems are worth serious -increase-dvd-storage-up-to-one-petabyte. consideration. Now is the time for architects 4. R. Miller, “Facebook Builds Exabyte Data to consider incorporating biomolecules as an Centers for Cold Storage,” blog, 18 Jan. 2013; integral part of computer design. DNA-based www.datacenterknowledge.com/archives storage is one clear, practical example of this /2013/01/18/facebook-builds-new-data-centers direction. Biotechnology has benefited tre- -for-cold-storage. mendously from progress in silicon technol- ogy developed by the computer industry; 5. G.M. Church, Y. Gao, and S. Kosuri, “Next- perhaps now is the time for the computer Generation Digital Information Storage in industry to borrow back from the biotechnol- DNA,” Science, vol. 337, no. 6102, 2012, ogy industry to advance the state of the art in pp. 1628–1629. computer systems. MICRO 6. C.T. Clelland, V. Risca, and C. Bancroft, “Hiding Messages in DNA Microdots,” Nature, vol. 399, 1999, pp. 533–534. Acknowledgments We thank the members of the Molecular 7. N. Goldman et al., “Towards Practical, High- Information Systems Laboratory for their con- Capacity, Low-Maintenance Information tinuing support of this work. We thank Bran- Storage in Synthesized DNA,” Nature, don Holt, Emina Torlak, Xi Wang, the Sampa vol. 494, 2013, pp. 77–80. group at the University of Washington, and 8. M.E. Allentoft et al., “The Half-Life of DNA the anonymous reviewers for feedback on in Bone: Measuring Decay Kinetics in 158 ...... MAY/JUNE 2017 103 ...... TOP PICKS

Dated Fossils,” Proc. Royal Society of University. Contact him at dcarmean@ London B: Biological Sciences, vol. 279, no. microsoft.com. 1748, 2012, pp. 4724–4733. 9. J. Bornholt et al., “A DNA-Based Archival Luis Ceze is the Torode Family Associate Storage System,” Proc. 21st Int’l Conf. Professor in the Paul G. Allen School of Architectural Support for Programming Computer Science and Engineering at the Languages and Operating Systems University of Washington. His research (ASPLOS), 2016, pp. 637–649. interests include the intersection between computer architecture, programming lan- 10. M. Brunker, “Microsoft and University of guages, and biology. Ceze received a PhD in Washington Researchers Set Record for computer science from the University of Illi- DNA Storage,” blog, 7 July 2016; http:// nois at Urbana–Champaign. Contact him at blogs.microsoft.com/next/2016/07/07 [email protected]. /microsoft-university-washington -researchers-set-record-dna-storage. Georg Seelig is an associate professor in the 11. L Organick et al., “Scaling Up DNA Data Department of Electrical Engineering and Storage and Random Access Retrieval,” the Paul G. Allen School of Computer Sci- bioRxiv, 2017; doi:10.1101/114553. ence and Engineering at the University of 12. R. Carlson, “Time for New DNA Synthesis Washington. His research interests include and Sequencing Cost Curves,” blog, 12 understanding how biological organisms Feb. 2014; www.synthesis.cc/2014/02 process information using complex biochemi- /time-for-new-cost-curves-2014.html. cal networks and how such networks can be 13. “Oxford Nanopore Technologies,” http:// engineered to program cellular behavior. See- nanoporetech.com. lig received a PhD in physics from the Uni- versity of Geneva. Contact him at gseelig@ 14. M. Blawata et al., “Forward Error Correction uw.edu. for DNA Data Storage,” Procedia Computer Science, vol. 80, 2016, pp. 1011–1022. Karin Strauss is a senior researcher at Microsoft Research and an affiliate faculty at James Bornholt is a PhD student in the the University of Washington. Her research Paul G. Allen School of Computer Science interests include studying the application of and Engineering at the University of Wash- biological mechanisms and other emerging ington. His research interests include pro- technologies to storage and computation, gramming languages and formal methods, and building systems that are efficient and focusing on program synthesis. Bornholt reliable with them. Strauss received a PhD received an MS in computer science from in computer science from the University of the University of Washington. Contact him Illinois at Urbana–Champaign. Contact her at [email protected]. at [email protected].

Randolph Lopez is a graduate student in bioengineering at the University of Wash- ington. His research interests include the intersection of synthetic biology, DNA nanotechnology, and molecular diagnostics. Lopez received a BS in bioengineering from the University of California, San Diego. Contact him at [email protected].

Douglas M. Carmean is a partner architect at Microsoft. His research interests include new architectures on future device technol- Read your subscriptions through ogy. Carmean received a BS in electrical and the myCS publications portal at http://mycs.computer.org. electronics engineering from Oregon State ...... 104 IEEE MICRO

......

TI-STATES:POWER MANAGEMENT IN ACTIVE TIMING MARGIN PROCESSORS

......

TEMPERATURE INVERSION IS A TRANSISTOR-LEVEL EFFECT THAT IMPROVES PERFORMANCE

WHEN TEMPERATURE INCREASES.THIS ARTICLE PRESENTS A COMPREHENSIVE

MEASUREMENT-BASED ANALYSIS OF ITS IMPLICATIONS FOR ARCHITECTURE DESIGN AND

POWER MANAGEMENT USING THE AMD A10-8700P PROCESSOR.THE AUTHORS

PROPOSE TEMPERATURE-INVERSION STATES (TI-STATES) TO HARNESS THE OPPORTUNITIES

PROMISED BY TEMPERATURE INVERSION.THEY EXPECT TI-STATES TO BE ABLE TO IMPROVE

THE POWER EFFICIENCY OF MANY PROCESSORS MANUFACTURED IN FUTURE CMOS Yazhou Zu University of Texas at Austin TECHNOLOGIES. Wei Huang ...... Temperature inversion refers to as P-states and C-states. This evolution is Indrani Paul the phenomenon that transistors switch faster enabled by the growing manifestation of the at a higher temperature when operating under transistor’s temperature-inversion effect as certain regions. To harness temperature inver- device feature size scales down. sion’s performance benefits, we introduce When temperature increases, transistor Vijay Janapa Reddi Ti-states, or temperature-inversion states, for performance is affected by two factors: a active timing-margin management in emerg- decrease in both carrier mobility and thresh- University of Texas at Austin ing processors. Ti-states are frequency, temper- old voltage. Reduced carrier mobility causes ature, and voltage triples that enable processor devices to slow down, whereas reduced timing-margin adjustments through runtime threshold voltage causes devices to speed up. supply voltage changes. Similar to P-states’ When supply voltage is low enough, transistor frequency-voltage table lookup mechanism, speed is sensitive to minute threshold voltage Ti-states operate by indexing into a temper- changes, which makes the second factor ature-voltage table that resembles a series (threshold voltage reduction) dominate. In of power states determined by transistors’ this situation, temperature inversion occurs.1 temperature-inversion effect. Ti-states push In the past, designers have safely dis- greater efficiency out of the underlying counted temperature inversion because it processor, specifically in active timing- does not occur under a processor’s normal margin-based processors. operation. However, as transistor feature size Ti-states are the desired evolution of clas- scales down, today’s processors are operating sical power-management mechanisms, such close to the temperature inversion’s voltage ......

106 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE region. Therefore, the speedup benefit of 1.4 temperature inversion deserves more atten- Nominal Inflection tion from architects and system operators. 1.2 Figure 1a provides a device simulation 1.0 analysis based on predictive technology mod- Measurement 0.8 at 28 nm els.2,3 We use inflection voltage to denote the Voltage (V) Voltage 0.6 crossover point for temperature inversion to pstates occur. Below the inflection voltage is the 0.4 temperature-inversion region, in which cir- cuits speed up at high temperature. Above (a) 90 nm 65 nm 45 nm 32 nm 22 nm the inflection voltage is the noninversion 20 region, in which circuits slow down at high temperature. From 90 nm to 22 nm, the 15 inflection voltage keeps increasing and 10 approaches the processor’s nominal voltage. 0.7 V, speed up This means temperature inversion is becom- 5 ing more likely to occur in recent smaller 0.9 V, unchanged 0 technologies. speedup (%) Circuit 1.1 V, slow down Our silicon measurement corroborates –5 and strengthens this projection. The meas- 20 40 60 80 ured 28-nm AMD A10-8700P processor’s (b) Temperature (C) inflection voltage falls within the range of the processor’s different P-states. Figure 1b fur- Figure 1. Temperature inversion is having ther illustrates temperature inversion by con- more impact on processor performance as trasting circuit performance in the inversion technology scales. (a) Temperature and noninversion regions. At 1.1 V, the cir- inversion was projected to be more cuit is slightly slower at a higher temperature common in smaller technologies as its while safely meeting the specified frequency, inflection voltage keeps increasing and as expected from conventional wisdom. At approaches nominal supply. (b) High 0.7 V, however, this circuit becomes faster by temperature increases performance under more than 15 percent at 80C as a result of low voltage due to temperature inversion, temperature inversion. compared to conventional wisdom under Ti-states harness temperature inversion’s high voltage. speedup effect by actively undervolting to save power. Ti-states exploit the fact that the faster circuits offered by temperature inver- Measuring Temperature Inversion sion add extra margin to the processor’s clock We measure temperature inversion on a 28- cycle. It then calculates the precise amount of nm AMD A10-8700P accelerated processing voltage that can be safely reduced to reclaim unit (APU).4 The APU integrates two CPU the extra margin. The undervolting decision core pairs, eight GPU cores, and other system for each temperature is stored in a table for components. We conduct our study on both runtime lookup. the CPU and GPU and present measure- Ti-states are instrumental because they ments at the GPU’s lowest P-state of 0.7 V can apply to almost all processors manufac- and 300 MHz, because it has strong tempera- tured with today’s technologies that manifest ture inversion. The temperature-inversion a strong temperature-inversion effect, includ- effect we study depends on supply voltage ing bulk CMOS, fin field-effect transistor but not on the architecture. Thus, we expect (FinFET), and fully depleted silicon on insu- the analysis on the AMD-integrated GPU to lator (FD-SOI). The comprehensive charac- naturally extend to the CPU and other archi- terization we present in this article is based tectures as well for all processor vendors. on rigorous hardware measurement, and it We leverage the APU’s power supply mon- can spawn future work that exploits the tem- itors (PSMs) to accurately measure circuit perature-inversion effect. speed changes under different conditions.5 ...... MAY/JUNE 2017 107 ...... TOP PICKS

The Temperature-Inversion Effect Temperature inversion primarily affects circuit Edge performance. We first explain temperature propagation inversion’s performance impact with respect to supply voltage and temperature. We then extrapolate the power optimization potential offered by temperature inversion. Through our measurement, we make two observa- tions: temperature inversion’s speedup effects become stronger with lower voltage, and the speedup can be turned into more than 5 per- cent undervolting benefits.

Inversion versus Noninversion We contrast temperature inversion and non- inversion effects by sweeping across a wide operating voltage range. Figure 4 shows the circuit speed change under different supply voltages and die temperatures. Speed is Figure 2. A power supply monitor (PSM) is a reflected by the PSM’s normalized output—a ring of inverters inserted between two higher value implies a faster circuit. We keep pipeline latches. It counts the number of the chip idle to avoid any workload disturb- inverters an “edge” travels through in one ance, such as the di/dt effect. cycle to measure circuit speed. Figure 4 illustrates the insight that the temperature’s impact on circuit performance depends on the supply voltage. In the high Figure 2 illustrates a PSM’s structure. A PSM supply-voltage region around 1.1 V, the is a time-to-digital converter that reflects PSM’s reading becomes progressively smaller circuit time delay in numeric form. Its core as the temperature rises from 0Cto100C. component is a ring oscillator that counts the The circuit operates slower at a higher tem- number of inverters an “edge” has traveled perature, which aligns with conventional through in each clock cycle. When the circuit belief. The reason for this circuit perform- is faster, an edge can pass more inverters, and ance degradation is that the transistor’s carrier a PSM will produce a higher count output. mobility decreases at a higher temperature, We use a PSM as a means to characterize cir- leading to smaller switch-on current (Ion)and cuit performance under temperature varia- longer switch time. tion. We normalize the PSM reading to a Under a lower supply voltage, the PSM’s reference value measured under 0.7 V, 300 reading increases with higher temperature, MHz, 0C, and idle chip condition. which means the circuit switches faster (that To characterize the effect of temperature is, the temperature-inversion phenomenon). inversion on performance and power under The reason is because the transistor’s thresh- different operating conditions, we carefully old voltage (Vth) decreases linearly as tempera- regulate the processor’s on-die temperature ture increases. For the same supply voltage, a using a temperature feedback control system lower Vth provides more drive current (Ion), (see Figure 3). The feedback control checks which makes the circuit switch faster. The die temperature measured via an on-chip speedup effect is more dominant when supply thermal diode and adjusts the thermal head voltage is low, because then the supply voltage temperature every 10 ms to set the chip tem- is closer to Vth,andanyminutechangeofVth perature to a user-specified value. Physically, can affect transistor performance. the thermal head’s temperature is controlled An “inflection voltage” exists that balan- via a water pipe and a heater to control its ces high temperature’s speedup and slow- surface temperature. down effects. On the processor we tested, ...... 108 IEEE MICRO the inflection voltage is between 0.9 V and 1 V. Around this point, temperature does not have a notable impact on circuit perform- Measured plate temperature ance. Technology evolution has made more Water pipe chip operating states fall below the inflection Setpoint voltage (that is, in the temperature-inversion region). For the APU we tested, half of the GPU’s P-states, ranging from 0.75 to 1.1 V, operate in the temperature-inversion region. Thermal head Therefore, we must carefully inspect temper- ature inversion and take advantage of its Connect speedup effect. to heater Processor Active Timing Margin’s Undervolting Potential We propose to harness temperature inversion’s speedup effect by reclaiming the extra pipeline Figure 3. Temperature control setup. The thermal head’s temperature is timing margin provided by the faster circuitry. controlled via a water pipe and a heater. The water pipe is connected to an Specifically, we propose to actively undervolt external chiller to offer low temperatures while the heater increases to shrink the extra timing margin, an approach temperature to reach the desired temperature setting. similar in spirit to prior active-timing-margin management schemes.6 To explore the optimi- zation space, we first estimate the undervolting potential using PSM measurement. 16 100 C 80 C 60 C 40 C Figure 5 illustrates the estimation proc- 20 C 0 C Noninversion ess. The x-axis zooms into the low-voltage 12 region between 0.6 and 0.86 V in Figure 4 to give a closer look at the margin-reduction 8 Inflection voltage

opportunities. Normalized PSM Temperature inversion’s performance ben- Temperature efit becomes stronger at lower voltages, as 4 inversion  reflected by the widening gap between 100 C 0.6 0.7 0.8 0.9 1.0 1.1 and 0C. At 0.7 V, the PSM difference Supply voltage (V) between 100Cand0C represents the extra timing margin in the units of inverter delays. Figure 4. Circuit speed changes under different supply voltages and die In other words, it reflects how much faster the temperatures. Temperature inversion happens below 0.9 V and is circuits run at a higher temperature by count- progressively stronger when voltage scales down. ing how many more inverters the faster circuit can switch successively in one cycle. To bring the faster circuit back to its original speed, supply voltage needs to be reduced such that 5 100 C under a higher temperature the PSM can read 0 C thesamevalue.Weestimatethevoltage 4 Extra margin reduction potential with linear extrapolation. 3 Voltage

At 0.7 V, the extra margin translates to a 46- Normalized PSM 2 46 mV mV voltage reduction, equivalent to 5 percent reduction potential undervolting potential. See our original paper 0.60 0.65 0.70 0.75 0.80 0.85 formorecompleteextrapolationresults.7 Supply voltage (V) Temperature-Inversion States Figure 5. Temperature inversion happens Based on our temperature-inversion charac- below 0.9 V. It speeds up circuits, as reflected by larger PSM values under higher terization, we propose Ti-states to construct a safe undervolting control loop to reclaim the temperatures, and becomes stronger when extra timing margin provided by temperature voltage scales down...... MAY/JUNE 2017 109 ...... TOP PICKS

inversion. In doing this, we must not intro- The algorithm repeatedly stress tests the duce additional pipeline timing threats for processor under different temperature-voltage reliability purposes, such as overly reducing environments with a set of workloads, and timing margins or enlarging voltage droops produces a temperature-voltage table that caused by workload di/dt effects. can be stored in system firmware.9 At run- To guarantee timing safety, we use the tim- time, the system can index into this table to ing margin measured at 0Casthe“golden” actively set the supply voltage according to reference. We choose 0Casthereference runtime temperature measurement. because it represents the worst-case operating Algorithm 1 uses a set of workloads as the condition under temperature inversion. Work- training sets to first get a tentative tempera- loads that run safely at 0C are guaranteed to ture-voltage mapping. We then validate this pass under higher temperatures, because tem- mapping with another set of test workloads. perature inversion can make circuits run faster. During the training stage, Algorithm 1 Although 0C rarely occurs in desktop, first measures each workload’s golden refer- mobile, and datacenter applications, during ence timing margin at 0CusingPSMs.The the early design stage, timing margins should timing margin is recorded as the worst-case be set to tolerate these worst-case conditions. margin during the entire program run. Then, In industry, 0C or below is used as a standard at each target temperature, Algorithm 1 circuit design guideline.8 In critical scenarios, selects four candidate voltages around the an even more conservative reference of –25C extrapolated voltage value as in Figure 5. The is adopted. four candidate voltages are stepped through, Ti-states’ undervolting goal is to maintain and each workload’s timing margin is recorded the same timing margin as 0C when a chip using PSMs. Finally, the timing margins at dif- is operating at a higher temperature. In other ferent candidate voltages are compared against  words, the voltage Ti-state sets should always the 0 C reference, and the voltage with the make the timing margins measured by the minimum PSM difference is taken as the  PSM match 0 C. Under this constraint, target temperature’s Ti-state voltage. Ti-states undervolt to maximize power saving. Table 1 shows the PSM difference com- Algorithm 1 summarizes the methodol- pared with the 0C reference across different    ogy to construct Ti-states: candidate voltages for 20 C, 40 C, 60 C,  and 80 C. The selected Ti-state voltages with procedure the smallest difference are shown in bold in 1: GET REFERENCE MARGIN  2: set voltage and temperature to reference the table. For instance, at 80 C, 0.6625 V is 3: for each training workload do the Ti-state, which provides around 5% volt- 4: workloadMargin ← PSM measurement age reduction benefits. 5: push RefMarginArr, workloadMargin We observed from executing Algorithm 1 return RefMarginArr that a Ti-state’s undervolting decision is inde- pendent of the workloads. It achieves the 6:procedure EXPLORE UNDERVOLT same margin reduction effects across all pro- ← 7: initVDD idle PSM extrapolation grams. This makes sense, because tempera- 8: candidateVDDArr ← voltage around initVDD ture inversion is a transistor-level effect and 9: minErr ← MaxInt 10: set exploration temperature does not depend on other architecture or pro- 11: for each VDD in candidateVDDArrdo gram behavior. This observation is good for 12: set voltage to VDD Ti-states, because it justifies the applicability 13: for each training workload do of the undervolting decision made from a 14: workloadMargin ← PSM measurement small set of test programs to a wide range of 15: push TrainMarginArr, workloadMargin future unknown workloads. 16: err ← diff(RefMarginArr,TrainMarginArr) Figure 6 illustrates our observation. Going if then 17: err configurations.

Candidate voltages (mV) 208C408C608C808C 1008C

693.75 3.7% — — — — 687.50 2.2%— — — — 681.25 8.4% 2.3%— — — 675.00 13.9% 5.3% 4.9% — — 668.75 — 9.5% 2.5%— — 662.50 — 13.5% 6.5% 1.9%— 656.25 — — 12.2% 5.6% 9.9% 650.00...... — — — 9.3% .... 5.1%

*Bold type indicates the voltages with the smallest PSM difference.

closely track the baseline for all workloads. Overall, Ti-states can achieve 6 to 12 percent power savings on our measured chip across 1.0 Temperature different temperatures. Undervolt safely 0.7 V, 80 C inversion offers reclaims margin 0.6625 V, 80 C 0.9 more margin Long-Term Impact 0.7 V, 0 C

As CMOS technology scales to its end, it Normalized PSM 0.8 is important to extract as much efficiency improvement opportunity as possible from Benchmark the underlying transistors. Ti-state achieves this goal with active timing-margin man- Figure 6. Temperature inversion’s speedup effect offers extra timing margin agement. Exploiting slack in timing mar-  at 80 C, as reflected by the elevated workload worst-case PSM. Ti-state gins to improve processor efficiency will be precisely reduces voltage to have the same timing margin as under 0C and ubiquitous, just as P-states and C-states nominal voltage, which achieves better efficiency and guarantees reliability. have helped reduce redundant power in the past. We believe the simplicity of Ti-states and the insights behind them render a wide Ti-state’s design is succinct. Its main com- range of applicability. Our work brings ponents are on-chip timing margin sensors, temperature inversion’s value from device temperature sensors, and system firmware level to architects and system managers, and that stores Ti-state tables. A Ti-state’s runtime opens doors for other ideas to improve pro- overhead is a table lookup and a voltage regu- cessor efficiency. lator module’s set command, which are mini- mal. Because chip temperature changes over Wide Range of Applicability the course of several seconds, a Ti-state’s feed- Ti-state is purely based on transistor’s temper- back loop has no strict latency requirement, ature-inversion effect and is independent of which makes it easy to design, implement, other factors. Temperature inversion is an and test. opportunity offered by technology scaling, which makes it a free meal for computer Implications at Circuit, Architecture, and architects. Therefore, Ti-state is applicable to System Level chips made with today’s advanced technolo- Our study conducted on an AMD A10- gies, including bulk CMOS, FD-SOI, and 8700P processor focuses on a single chip FinFET (as we show in our original paper7). made in planar CMOS technology. Going Many, if not all, processor architectures can beyond current technology and across system benefit from it, whether they’re CPUs, GPUs, scale, Ti-states will have a bigger impact in FPGAs, or other accelerators. the future...... MAY/JUNE 2017 111 ...... TOP PICKS

Test time Runtime 1 2 Set V Set temperature 1. Per-part PSM characterization DD Processor at different (V, T) points Workload Fan control VRM On-die temp activity factor 2. Undervolt validation at sensor data Dynamic/leakage different temperature Find optimal power analysis temp 3. Fuse (V, F, T ) table into firmware/OS (V, F, T ) table Technology model Find desired VDD

Figure 7. Ti-state temperature and voltage control: two loops work in synergy to minimize

power. Loop 1 is a fast control loop that uses a Ti-state table to keep adjusting voltage in response to silicon temperature variation. Loop 2 is a slow control loop that sets the optimal temperature based on workload steady-state dynamic power profile.

Significance for FinFETand FD-SOI. FinFET processor power cost. For FinFET and and FD-SOI are projected to have stronger FD-SOI, the processor might prefer high and more common temperature-inversion temperatures around 60C to save power, 10,11 effects. In these technologies, Ti-states which further provides room for cooling have broader applicability and more bene- power reduction. fits. Furthermore, the low-leakage charac- Figure 7 shows a control mechanism that teristics of these technologies promise other we conceived to synergistically reduce chip opportunities for a tradeoff between tem- and cooling power. The test-time procedure perature and power. and loop 1 is what the Ti-state achieves. In Inouroriginalpaper,weprovideadetailed addition, loop 2 takes cooling system power FinFET and FD-SOI projection analysis into consideration and jointly optimizes fan based on measurements taken at 28-nm bulk and chip power together. Overall, tempera- CMOS. We find the 10-times leakage reduc- ture inversion and Ti-states enable an optimi- tion capabilities make these technologies zation space involving cooling power, chip enjoy a higher operating temperature, because power, and chip reliability. Ti-states reduce more VDD under higher tem- peratures. The optimal temperature for power Opportunity for near-threshold computing.   is usually between 40 Cto60C, depending Our measurement on a real chip shows that on workloads and device type. Thus, Ti-states temperature inversion is stronger at lower vol- not only reduce chip power itself for FinFET tages, reaching up to 10 percent VDD reduc- and FD-SOI but also relieve the burden of tion potential for a Ti-state at 0.6 V for our the cooling system. 28-nm chip. In near-threshold conditions as low as 0.4 V, temperature inversion will have System-level thermal management. Datacen- a much stronger effect and will offer much ters and supercomputers strive to make room larger benefits. In addition to power reduc- temperature low at the cost of very high tion, a Ti-state can be employed to boost the power consumption and cost. A tradeoff performance of near-threshold chips by over- between cooling power and chip leakage clocking directly to exploit extra margin. 12 power exists in this setting. Ti-states add Extrapolation similar to Figure 4 shows over- new perspective to this problem. First, we clocking potential is between 20 and 50 per- find that high temperature does not worsen cent with the help of techniques that mitigate timing margins, but actually preserves pro- di/dt effects.5 cessor timing reliability because of tempera- ture inversion. Second, Ti-states reduce emperature inversion offers a new ave- power under higher temperatures, mitigating nue for improving processor efficiency...... T 112 IEEE MICRO On the basis of detailed measurements, our 9. S. Sundaram et al., “Adaptive Voltage Fre- article presents a comprehensive analysis of quency Scaling Using Critical Path Accumu- how temperature inversion can alter the way lator Implemented in 28nm CPU,” Proc. we do power management today. Through 29th Int’l Conf. VLSI Design and 15th Int’l Conf. Embedded Systems (VLSID), 2016, the introduction of Ti-states, we show that active timing margin management can be pp. 565–566. successfully applied to exploit temperature 10. W. Lee et al., “Dynamic Thermal Manage- inversion. Applying such optimizations in ment for FinFET-Based Circuits Exploiting the future will likely become even more the Temperature Effect Inversion Phenom- important as technology scaling continues. enon,” Proc. Int’l Symp. Low Power Elec- We envision future work that draws on Ti- tronics and Design (ISLPED), 2014, pp. states to enhance computing systems across 105–110. MICRO the stack and at a larger scale. 11. E. Cai and D. Marculescu, “TEI-Turbo: Tem- perature Effect Inversion-Aware Turbo ...... Boost for FinFET-Based Multi-core Sys- References tems,” Proc. IEEE/ACM Int’l Conf. Com- 1. C. Park et al., “Reversal of Temperature puter-Aided Design (ICCAD), 2015, pp. Dependence of Integrated Circuits Operat- 500–507. ing at Very Low Voltages,” Proc. Int’l 12. W. Huang et al., “TAPO: Thermal-Aware Electron Devices Meeting (IEDM), 1995, Power Optimization Techniques for Servers pp. 71–74. and Data Centers,” Proc. Int’l Green Com- 2. D. Wolpert and P. Ampadu, “Temperature puting Conf. and Workshops (IGCC), 2011; Effects in Semiconductors,” Managing doi:10.1109/IGCC.2011.6008610. Temperature Effects in Nanoscale Adaptive Systems, Springer, 2012, pp. 15–33. Yazhou Zu is a PhD student in the Depart- 3. W. Zhao and Y. Cao, “New Generation of ment of Electrical and Computer Engineer- Predictive Technology Model for Sub-45 nm ing at the University of Texas at Austin. Early Design Exploration,” IEEE Trans. Elec- His research interests include resilient and tron Devices, vol. 53, no. 11, 2006, pp. energy-efficient processor design and man- 2816–2823. agement. Zu received a BS in microelec- 4. B. Munger et al., “Carrizo: A High Perform- tronics from Shanghai Jiao Tong University ance, Energy Efficient 28 nm APU,” of China. Contact him at yazhou.zu@ J. Solid-State Circuits (JSSC), vol. 51, no. 1, utexas.edu. 2016, pp. 1–12. 5. A. Grenat et al., “Adaptive Clocking System Wei Huang is a staff researcher at Advanced for Improved Power Efficiency in a 28nm Micro Devices Research, where he works on x86–64 Microprocessor,” Proc. IEEE Int’l energy-efficient processors and systems. Solid-State Circuits Conf. (ISSCC), 2014, Huang received a PhD in electrical and pp. 106–107. computer engineering from the University of Virginia. He is a member of IEEE. Con- 6. C.R. Lefurgy et al., “Active Management of tact him at [email protected]. Timing Guardband to Save Energy in POWER7,” Proc. 44th IEEE/ACM Int’l Symp. Indrani Paul is a principal member of the Microarchitecture (MICRO), 2011, pp. 1–11. technical staff at Advanced Micro Devices, 7. Y. Zu, “Ti-states: Processor Power Manage- where she leads the Advanced Power Man- ment in the Temperature Inversion Region,” agement group, which focuses on innovat- Proc. 49th Ann. IEEE/ACM Int’l Symp. Micro- ing future power and thermal management architecture (MICRO), 2016; doi:10.1109 approaches, system-level power modeling, /MICRO.2016.7783758. and APIs. Paul received a PhD in electrical 8. Guaranteeing Silicon Performance with and computer engineering from the Georgia FPGA Timing Models, white paper WP- Institute of Technology. Contact her at 01139-1.0, Altera, Aug. 2010. [email protected]...... MAY/JUNE 2017 113 ...... TOP PICKS

Vijay Janapa Reddi is an assistant professor tion, to enhance mobile quality of experi- in the Department of Electrical and Com- ence and improve the energy efficiency of puter Engineering at the University of high-performance computing systems. Janapa Texas at Austin. His research interests span Reddi received a PhD in computer science the definition of computer architecture, from Harvard University. Contact him at including software design and optimiza- [email protected].

...... 114 IEEE MICRO

...... AN ENERGY-AWARE DEBUGGER FOR INTERMITTENTLY POWERED SYSTEMS

......

DEVELOPMENT AND DEBUGGING SUPPORT IS A PREREQUISITE FOR THE ADOPTION OF

INTERMITTENTLY OPERATING ENERGY-HARVESTING COMPUTERS.THIS WORK IDENTIFIES

AND CHARACTERIZES INTERMITTENCE-SPECIFIC DEBUGGING CHALLENGES THAT ARE

UNADDRESSED BY EXISTING DEBUGGING SOLUTIONS.THIS WORK ADDRESSES THESE

CHALLENGES WITH THE ENERGY-INTERFERENCE-FREE DEBUGGER (EDB), THE FIRST

DEBUGGING SOLUTION FOR INTERMITTENT SYSTEMS.THIS ARTICLE DESCRIBES EDB’SCO-

DESIGNED HARDWARE AND SOFTWARE IMPLEMENTATION AND SHOWS ITS VALUE IN Alexei Colin SEVERAL DEBUGGING TASKS ON A REAL RF-POWERED ENERGY-HARVESTING DEVICE. Carnegie Mellon University

Graham Harvey ...... Energy-harvesting devices are execution model, programs are frequently, embedded computing systems that eschew repeatedly interrupted by power failures, in Alanson P. Sample tethered power and batteries by harvesting contrast to the traditional continuously pow- energy from radio waves,1,2 motion,3 temper- ered execution model, in which programs are Disney Research Pittsburgh ature gradients, or light in the environment. assumed to run to completion. Every reboot Small form factors, resilience to harsh envi- induced by a power failure clears volatile state Brandon Lucia ronments, and low-maintenance operation (such as registers and memory), retains non- make energy-harvesting computers well- volatile state (such as ferroelectric RAM), and Carnegie Mellon University suited for next-generation medical, indus- transfers control to some earlier point in the trial, and scientific sensing and computing program. applications.4 Intermittence makes software difficult to The power system of an energy-harvesting write and understand. Unlike traditional sys- computer collects energy into a storage ele- tems, the power supply of an energy-harvesting ment (that is, a capacitor) until the buffered computer changes high-level software behav- energy is sufficient to power the device. ior, such as control-flow and memory con- Once powered, the device can operate until sistency.5,6 Reboots complicate a program’s energy is depleted and power fails. After the possible behavior, because they are implicit failure, the cycle of charging begins again. discontinuities in the program’s control flow Thesecharge–dischargecyclespowerthesys- that are not expressed anywhere in the code. tem intermittently, and consequently, soft- A reboot can happen at any point in a pro- ware that runs on an energy-harvesting device gram and cause control to flow unintuitively executes intermittently.5 In the intermittent back to a previous point in the execution......

116 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE The previous point could be the beginning of the program, a previous checkpoint,5,7 or ataskboundary.6 Today, devices that execute intermittently are a mixture of conventional, volatile microcontroller architectures and non- volatile structures. In the future, alternative architectures based on nonvolatile structures may simplify some aspects of the execution model, albeit with lower performance and energy efficiency.8 Intermittence can cause correct software to misbehave. Intermittence-induced jumps back to a prior point in an execution inhibit forward progress and could repeatedly execute code that should not be repeated. Intermit- (a) Energy monitoring tence can also leave memory in an inconsis- and manipulation tent state that is impossible in a continuously Monitor Analog buffer powered execution.5 These intermittence- Charge Capacitor discharge Diode RF In related failure modes are avoidable with care- Energy

Interrupt harvester fully written code or specialized system sup- MCU Code markers MCU 5–7,9,10 \ port. Unaddressed, these failure modes n Monitor Rx/Tx I/O device #1 represent a new class of intermittence bugs … … Digital that manifest only when executing on an level shifter DC Monitor Rx/Tx I/O device #d + intermittent power source. – EH device

To debug an intermittently operating EDB (b) Program event monitoring program, a programmer needs the ability to I/O monitoring monitor system behavior, observe failures, and examine internal program state. With Turn on Brownout Checkpoint Wild pointer the goal of supporting this methodology, prior work on debugging for continuously Threshold for powered devices has recognized the need to voltage operation Harvested minimize resources required for tracing11 Time and reduce perturbation to the program (c) Intermittent computation under test.12 A key difference on energy- harvesting platforms is that interference Figure 1. The Energy-Interference-Free Debugger (EDB) is an energy- with a device’s energy level could perturb its interference-free system for monitoring and debugging energy-harvesting intermittent execution behavior. Unfortu- devices. (a) Photo. (b) Architecture diagram. (c) The charge–discharge cycle nately, existing tools, such as Joint Test makes computation intermittent. Action Group (JTAG) debuggers, require a device to be powered, which hides intermit- continuously powered devices are not effec- tence bugs. Programmers face an unsatisfying tive for energy-harvesting devices, because dilemma:touseadebuggertomonitorthe they interfere with the target’s power supply. system and never observe a failure, or to run Our first contribution is a hardware device without a debugger and observe the failure, that connects to a target energy-harvesting but without the means to probe the system to device with the ability to monitor and manip- understand the bug. ulate its energy level, but without permitting This article identifies the key debugging any significant current to flow between the functionality necessary to debug intermittent debugger and the target. programs on energy-harvesting platforms and Second, we observe that basic debugging presents the Energy-Interference-Free Debug- techniques, such as assertions, printf trac- ger (EDB), a hardware–software platform ing, and LED toggling, are not usable on that provides that functionality (see Figure 1). intermittently powered devices without sys- First, we observe that debuggers designed for tem support. Our second contribution is the ...... MAY/JUNE 2017 117 ...... TOP PICKS

Source Continuous Intermittent __NV list_t list execution execution main(){ init_list(list) Time Time while(true){[true] while(true){[true] while(true){ Checkpoint __NV elem e select(e) select(e) select(e) remove(list,e) remove(list,e) remove(list,e) update(e) e->prev->next=e->next e->prev->next=e->next append(list,e) if(e==list->tail)[false] if(e==list->tail) [false] } e->next->prev=e->prev e->next->prev=e->prev } update(e) update(e) append(list,e){ Power fails before list->tail=e e->next = NULL append(list,e) append(list,e) e->prev = list->tail e->next=NULL e->next=NULL list->tail->next = e e->prev=list->tail e->prev=list->tail list->tail = e list->tail->next=e list->tail->next=e } list->tail=e Reboot! Back to checkpoint select(e) remove(list,e){ Should be true, but append e->prev->next = Always executes remove(list,e) rebooted e->next completely w/ e->prev->next=e->next if(e==list->tail){ continuous power if(e==list->tail)[false] tail = e->prev e->next->prev=e->prev }else{ e->next->prev = Bug! Writing a wild pointer e->prev because e->next = NULL } }

The example program on the left illustrates how intermittence The illustrated intermittent execution of the example code perturbs a program’s execution. The code manipulates a exhibits incorrect behavior that is impossible in a continuous linked-list in nonvolatile memory using append and remove execution. The execution violates the precondition assumed functions. The continuous execution completes the code by remove that only the tail’s next should be NULL. sequentially. The intermittent execution, however, is not The reboot interrupts append before it can make node e the sequential. Instead, the code captures a checkpoint at the list’s new tail, but after its next pointer is set to NULL. top of the while loop, then proceeds until power fails at the When execution resumes at the checkpoint, it attempts to indicated point. After the reboot, execution resumes from remove node e again. The conditional in remove confirms the checkpoint. In some cases, an execution resumed from that e is not the tail, then dereferences its next pointer the checkpoint mimics a continuously powered execution. (which is NULL). The NULL next pointer makes e ->next->prev However, intermittence can also cause misbehavior a wild pointer that, when written, leads to undefined behavior. stemming from an intermittence bug in the code.

Figure 2. An intermittence bug. The software executes correctly with continuous power, but incorrectly in an intermittent execution.

EDB software system, which was codesigned because the behavior of software on an inter- with EDB’s hardware to make debugging mittent system is closely linked to its power primitives that are useful for intermittently supply. Figure 2 illustrates the undesirable powered devices, including energy breakpoints consequences of disregarding this link and keep-alive assertions. EDB addresses between the software and the power system. debugging needs unique to energy-harvesting The code has an intermittence bug that leads devices, with novel primitives for selectively to memory corruption only when the device powering spans of code and for tracing the runs on harvested energy. device’s energy level, code events, and fully Debugging intermittence bugs using exist- decoded I/O events. The whole of EDB’s ing tools is virtually impossible due to energy capabilities is greater than the sum of capabil- interference from these tools. JTAG debug- ities of existing tools, such as a JTAG debugger gers supply power to the device under test and an oscilloscope. Moreover, EDB is sim- (DUT), which precludes observation of a pler to use and far less expensive. We apply realistically intermittent execution, such as EDB’s capabilities to diagnose problems on the execution on the left in Figure 2. Even real energy-harvesting hardware in a series of JTAG, with a power rail isolator completely case studies in our evaluation. masks intermittent behavior, because the protocol requires the target to be powered Intermittence Bugs and Energy Interference throughout the debugging session. An oscillo- An intermittent power source complicates scope can directly observe and trace a DUT’s understanding and debugging of a system, power system and lines, but a scope cannot ...... 118 IEEE MICRO for(…){ sense(&s) ok=check(s) if(ok){ i++ data[i]=s Capabilities Manipulate Measure Trace program Trace I/O (a) energy level energy level events events

Assertions Energy logging Event logging I/O logging Energy guards/ Energy breakpoints Code breakpoints instrumentation Code/energy breakpoints Interactive primitives Debugging debugging

(b) Active mode Passive mode

libEDB API Debug console commands assert (expr) charge|discharge energy_level break|watch_point (id) break|watch en|dis id [energy_level] energy_guard (begin|end) trace {energy, iobus, rfid, watch_points} printf (fmt, ...) read|write address [value]

(c)

Figure 3. EDB’s features support debugging tasks and developer interfaces. (a) Hardware and software capabilities. (b) Debugging primitives. (c) API and debug console commands. observe internal software state, which limits ging approaches. In this section, we describe its value for debugging. EDB’s functionality and its implementation Debugging code added to trace and react in codesigned hardware and software. to certain program events—such as toggling Figure 3 provides an overview of EDB. LEDs, streaming events to a universal asyn- The capabilities of EDB’s hardware and chronous receiver/transmitter (UART), or in- software (Figure 3a) support EDB’s debug- memory logging—has a high energy cost, and ging primitives (Figure 3b). The hardware such instrumentation can change program electrically isolates the debugger from the behavior. For example, activating an LED to target. EDB has two modes of operation: indicate when the Wireless Identification and passive mode and active mode. In passive Sensing Platform (WISP)2 is actively executing mode, the target’s energy level, program increases its total current draw by five times, events, and I/O can be monitored unobtru- from around 1 mA to more than 5 mA. Fur- sively. In active mode, the target’s energy level thermore, in-code instrumentation is limited and internal program state (such as memory) by scarcity of resources, such as nonvolatile can be manipulated. We combine passive- and memory to store the log, and an analog-to- active-mode operation to implement energy- digital converter (ADC) channel for measure- interference-free debugging primitives, includ- ments of the device’s energy level. ing energy and event tracing, intermittence- Energy interference and the lack of visibil- aware breakpoints, energy guards for instru- ity into intermittent executions make prior mentation, and interactive debugging. approaches to debugging inadequate for intermittently powered devices. Passive-Mode Operation EDB’s passive mode of operation lets devel- Energy-Interference-Free Debugging opers stream debugging information to a EDB is an energy-interference-free debug- host workstation continuously in real time, ging platform for intermittent devices that relying on the three rightmost components in addresses the shortcomings of existing debug- Figure 3a. Important debugging streams that ...... MAY/JUNE 2017 119 ...... TOP PICKS

are available through EDB are the device’s EDB’s energy compensation mechanism energy level, I/O events on wired buses, is implemented using two GPIO pins con- decoded messages sent via RFID, and program nected to the target capacitor, an ADC, and events marked in application code. A major a software control loop. To prevent energy advantage of EDB is its ability to gather many interference by these components during pas- debugging streams concurrently, allowing the sive mode, the circuit includes a low-leakage developer to correlate streams (for example, keeper diode and sets the GPIO pins to high- relating changes in I/O or program behavior impedance mode. To charge the target to a with energy changes). Correlating debugging desired voltage level, EDB keeps the source streamsisessential,butdoingsoisdifficultor pin high until EDB’s ADC indicates that the impossible using existing techniques. Another target’s capacitor voltage is at the desired key advantage of EDB is that data is collected level. To discharge, the drain pin is kept low externally without burdening the target or per- to open a path to ground through a resistor, turbing its intermittent behavior. until the target’s capacitor voltage reaches the Monitoring signals in the target’s circuit desired level. Several of the debugging primi- requires electrical connections between the tives presented in the next section are built debugger and the target, and EDB ensures using this energy-compensation mechanism. that these connections do not allow signifi- cant current exchange, which could interfere EDB Primitives with the target’s behavior. To measure the tar- Using the capabilities described so far, EDB get energy level, EDB samples the analog creates a toolbox of energy-interference-free voltage from the target’s capacitor through an debugging primitives. EDB brings to intermit- operational amplifier buffer. To monitor digi- tent platforms familiar debugging techniques tal communication and program events with- that are currently confined to continuously out energy interference, EDB connects to powered platforms, such as assertions and wired buses—including Inter-Integrated Cir- printf tracing. New intermittence-aware cuit (I2C), Serial Peripheral Interface (SPI), primitives, such as energy guards, energy RF front-end general-purpose I/Os (GPIOs), breakpoints, and watch points, are introduced and watch point GPIOs—through a digital to handle debugging tasks that arise only on level-shifter. As an external monitor, EDB intermittently powered platforms. Each prim- can collect and decode raw I/O events, even itive is accessible to the end user through two if the target violates the I/O protocol due to complimentary interfaces: the API linked into an intermittence bug. the application and the console commands on a workstation computer (see Figure 3c). Active-Mode Operation EDB’s active mode frees debugging tasks Code and energy breakpoints. EDB imple- from the constraint of the target device’s small ments three types of breakpoints. A code energy store by compensating for energy con- breakpoint is a conventional breakpoint that sumed during debugging. In active mode, the triggers at a certain code point. An energy programmer can perform debugging tasks breakpoint triggers when the target’s energy that require more energy than a target could level is at or below a specified threshold. A ever harvest and store—for example, complex combined breakpoint triggers when a certain invariant checks or user interactions. EDB code point executes and the target device’s has an energy compensation mechanism that energy level is at or below a specified thresh- measures and records the energy level on the old. Breakpoints conditioned on energy level target device before entering active mode. can initiate an interactive debugging session While the debugging task executes, EDB sup- precisely when code is likely to misbehave plies power to the target. After performing due to energy conditions—for example, just the task, EDB restores the energy level to the as the device is about to brownout. level recorded earlier. Energy compensation permits costly, arbitrary instrumentation, Keep-alive assertions. EDB provides support while ensuring that the target has the behav- for assertions on intermittent platforms. ior of an unaltered, intermittent execution. When an assertion fails, EDB immediately ...... 120 IEEE MICRO tethers the target to a continuous power sup- our prototype is available (http://intermittent ply to prevent it from losing state by exhaust- .systems). The purpose of our evaluation is ing its energy supply. This keep-alive feature twofold. First, we characterize potential sour- turns what would have to be a post-mortem ces of energy interference and show that EDB reconstruction of events into an investigation is free of energy interference. Second, we use a on a live device. The ensuing interactive series of case studies conducted on a real debugging session for a failed assert includes energy-harvesting system to show that EDB the entire live target address space and I/O supports monitoring and debugging tasks that busestoperipherals.IncontrasttoEDB’s are difficult or impossible without EDB. keep-alive assertions, traditional assertions are Our target device is a WISP2 powered by ineffective in intermittent executions. After a radio waves from an RFID reader. The WISP traditional assertion fails, the device would has a 47 lF energy-storage capacitor and an pause briefly, until energy was exhausted, then active current of approximately 0.5 mA at 4 restart, losing the valuable debugging infor- MHz. We evaluated EDB using several test mation in the live device’s state. applications, including the official WISP 5 RFID tag firmware and a machine-learning- Energy guards. Using its energy compensa- based activity-recognition application used in tion mechanism, EDB can hide the energy prior work.5,6 cost of arbitrary code enclosed within an energy guard. Code within an energy guard Energy Interference executes on tethered power. Code before and EDB’s edge over existing debugging tools is after an energy-guarded region executes as its ability to remain isolated from an inter- though no energy was consumed by the mittently operating target in passive mode energy-guarded region. Without energy cost, and its ability to create an illusion of an instrumentation code becomes nondisruptive untouched target energy reservoir in active and therefore useful on intermittent plat- mode. Our first experiment concretely dem- forms. Two especially valuable forms of instru- onstrates the energy interference of a tradi- mentation that are impossible without EDB tional debugging technique, when applied to are complex data structure invariant checks an intermittently operating system. The and event tracing. EDB’s energy guards allow measurements in Table 1 demonstrate the code to check data invariants or report appli- impact on program behavior of execution cation events via I/O (such as printf), the tracing using printf over UART without high energy cost of which would normally EDB. Without EDB, the energy cost of the deplete the target’s energy supply and prevent print statement significantly changes the iter- forward progress. ation success rate—that is, the fraction of iterations that complete without a power fail- Interactive debugging. An interactive debug- ure. Next, we show with data that EDB is ging session with EDB can be initiated by a effectively free of energy interference both in breakpoint, an assertion, or a user interrupt, passive- and active-mode operation. and allows observation and manipulation of In passive mode, current flow between the target’s memory state and energy level. EDB and the target through the connections Using charge–discharge commands, the devel- in Figure 1 can inadvertently charge or dis- oper can intermittently execute any part of a charge the target’s capacitor. We measured the program starting from any energy level, assess- maximum possible current flow over each ing the behavior of each charge–discharge connection by driving it with a source meter cycle. During passive-mode debugging, the and found that the aggregate current cannot EDB console delivers traces of energy state, exceed 0.85 lA in the worst case, representing watch points, I/O events, and printf output. just 0.2 percent of the target microcontroller’s typical active-mode current. Evaluation In active mode, energy compensation We built a prototype of EDB, including the requires EDB to save and restore the voltage of circuit board in Figure 1 and software that the target’s storage capacitor, and any discrep- implements EDB’s functionality. A release of ancy between the saved and restored voltage ...... MAY/JUNE 2017 121 ...... TOP PICKS

Table 1. Cost of debug output and its impact on the activity-recognition application’s behavior

Instrumentation Iteration success Iteration cost method rate (%) Energy (%*) Time (ms)

No print 87 3.0 1.1 UART printf 74 5.3 2.1 EDB...... printf 82 3.4 4.7

*Energy cost as percentage of 47 lF storage capacity.

represents energy interference. Using an oscil- Figure 4a. A conventional assertion is loscope, we measured the discrepancy between unhelpful: after the assertion fails, the the target capacitor voltage saved and restored target drains its energy supply and by EDB. Over 50 trials, the average voltage the program restarts, losing the con- discrepancy was just 4 percent of the target’s text of the failure. In contrast, EDB’s energy-storage capacity, with most error stem- intermittence-aware, keep-alive assert ming from our limited-precision software halts the program immediately when control loop. the list invariant is violated, powers the target, and opens an interactive Debugging Capabilities debugging session. We now illustrate the new capabilities that EDB brings to the development of intermit- Interactive inspection of target memory tent software by applying EDB in case studies using EDB’s commands reveals that the tail to debugging tasks that are difficult to resolve pointer points to the penultimate element, without EDB. not the actual tail. The inconsistency arose when a power failure interrupted append. Detecting memory corruption early. We eval- In the absence of the keep-alive assert, the uated how well EDB’s keep-alive assertions program would proceed to read this inconsis- help diagnose memory corruption that is not tent state, dereference a null pointer, and reproducible in a conventional debugger. write to a wild pointer.  Application. The code in Figure 4a Instrumenting code with consistency checks. maintains a doubly linked list in non- On intermittently powered platforms, the volatile memory. On each iteration of energy overhead of instrumentation code can the main loop, a node is appended to render an application nonfunctional by pre- the list if the list is empty; otherwise, venting it from making any forward progress. a node is removed from the list. The In this case study, we demonstrate how an node is initialized with a pointer to a application can be instrumented with an buffer in volatile memory that is later invariant check of arbitrary energy cost using overwritten. EDB’s energy guards.  Symptoms. After running on harvested energy for some time, the GPIO pin  Application. The code in Figure 4b indicating main loop progress stops generates the Fibonacci sequence toggling. After the main loop stops, numbers and appends each to a non- normal behavior never resumes, even volatile, doubly linked list. Each iter- after a reboot; thus, the device must ation of the main loop toggles a be decommissioned, reprogrammed, GPIO pin to track progress. The pro- and redeployed. gram begins with a consistency check  Diagnosis. To debug the list, we assert that traverses the list and asserts that that the list’s tail pointer must point the pointers and the Fibonacci value to the list’s last element, as shown in in each node are consistent...... 122 IEEE MICRO  Symptoms. Without the invariant check, the application silently produces an Application code Debug console

inconsistent list. With the invariant 1: init_list(list) > run tail node check, the main loop stops executing 2: while (1) Interrupted: after the list grows large. The oscillo- 3: node = list->head ASSERT line: 8 scope trace in Figure 4c shows an early 4: while (node->next != NULL) Vcap = 1.9749 5: node = node->next 0xAA 0xBB *> print node charge cycle when the main loop exe- 6: assert(list->tail == node) 0xAA10: 0x00BB cutes (left) and a later one when it does 7: if (node == list->head) *> print list->tail not (right). 8: init_node(new_node) 0xAA20: 0x00AA 9: append(list, new_node)  Diagnosis. The main loop stops execut- *> print list->tail.next 10: else ing because once the list is too long, the 0xAA30: 0x00BB 11: remove(list, node, &bufptr) consistency check consumes all of the 12: memset(bufpter, 0x42, BUFSZ) target’s available energy. Once reached, this hung state persists indefinitely. An (a) EDB energy guard allows the inclusion 1: main() of the consistency check without break- 2: energy_guard_begin() ing the application’s functionality (see 3: for (node in list) 4: assert(node->prev->next == node ==->next->prev) Figure 4b). The effect of the energy 5: assert(node->prev->fib + node->fib == node->next->fib) guard on target energy state is shown in 6: assert(list->tail == node) 7: energy_guard_end() Figure 4d. The energy guard provides 8: while(1) tethered power for the consistency 9: append_fibonacci_node(list) check, and the main loop gets the same (b) amount of energy in early charge– 3.0 discharge cycles when the list is short (left) and in later ones when the list is 2.5 V cap long (right). On intermittent power, 2.0 we observed invariant violations in V brownout

several experimental trials. (V) Voltage Check 1.5 Main loop Instrumentation and consistency check- ing are an essential part of building a reliable 1.0 application. These techniques are inaccessi- 20 30 40 50 60 70 80 +20 +30 +40 +50 +60 +70 +80 ble to today’s intermittent systems because (c) Time (ms) the cost of runtime checking and analysis is 3.0 arbitrary and often high. EDB brings instru- Tethered Tethered power power mentation and consistency checking to inter- 2.5 V cap mittent devices. 2.0 V brownout

Voltage (V) Voltage Check Tracing program events and RFID messages. 1.5 Extracting intermediate results and events Main loop 12 3 4 from the executing program using JTAG or 1.0 UART is valuable, but it often interferes with 20 30 40 50 60 70 80 +20 +30 +40 +50 +60 +70 +80 a target’s energy level and changes application (d) Time (ms) behavior. Moreover, communication stacks on energy-harvesting devices are difficult to Figure 4. Debugging intermittence bugs with EDB. (a) An application with a debug without simultaneous visibility into memory-corrupting intermittence bug, diagnosed using EDB’s intermittence- the device’s sent and received packet stream aware assert (left) and interactive console (right). (b) An application and energy state. instrumented with a consistency check of arbitrary energy cost using EDB’s In Table 1, we traced the activity recogni- energy guards. Oscilloscope trace of execution (c) without the energy guard tion application using EDB’s energy-interfer- and (d) with the energy guard. Without the energy guard, the check and main ence-free printf and watch points. In this loop both execute at first, but only the check executes in later discharge section, we trace messages in an RFID com- cycles. With an energy guard, the check executes on tethered power from munication stack using EDB’s I/O tracer. We instant 1 to 2 and 3 to 4, and the main loop always executes...... MAY/JUNE 2017 123 ...... TOP PICKS

used EDB to collect RFID message identifiers We created EDB because we found from the WISP RFID tag firmware, along energy-harvesting devices to be among the with target energy readings. From the collected least accessible platforms for research, requir- trace, we found that in our lab setup the appli- ing each researcher to reinvent ad hoc techni- cation responded 86 percent of the time for an ques for troubleshooting each device. EDB average of 13 replies per second. To produce makes intermittently powered platforms such a mixed trace of I/O and energy using accessible to a wider research audience and existing equipment, the target would have to helps establish a new research area surround- be burdened with logging duties that exceed ing intermittent computing. MICRO the computational resources, given the already high cost of message decoding and response...... nergy-harvesting technology extends the References E reachofembeddeddevicesbeyondtra- 1. S. Gollakota et al., “The Emergence of RF- ditional sensor network nodes by eliminating Powered Computing,” Computer, vol. 47, the constraints imposed by batteries and wires. no. 1, 2014, pp. 32–39. However, developing software for energy- 2. A.P. Sample et al., “Design of an RFID- harvesting devices is more difficult than tra- Based Battery-Free Programmable Sensing ditional embedded development, because of Platform,” IEEE Trans. Instrumentation and surprising behavior that arises when software Measurement, vol. 57, no. 11, 2008, pp. executes intermittently. Debugging intermit- 2608–2615. tently executing software is particularly 3. P. Mitcheson et al., “Energy Harvesting challenging because of a new class of inter- From Human and Machine Motion for Wire- mittence bugs that are immune to existing less Electronic Devices,” Proc. IEEE, vol. debugging approaches. Without effective 96, no. 9, 2008, pp. 1457–1486. debugging tools, energy-harvesting devices are accessible only to a small community of 4. J.A. Paradiso and T. Starner, “Energy Scav- systems experts instead of a wide community enging for Mobile and Wireless Electro- of application-domain experts. nics,” IEEE Pervasive Computing, vol. 4, no. We identified energy interference as the 1, 2005 pp. 18–27. fundamental shortcoming of available de- 5. B. Lucia and B. Ransford, “A Simpler, Safer bugging tools. We designed EDB, the first Programming and Execution Model for energy-interference-free debugging system Intermittent Systems,” Proc. 36th ACM that supports debugging primitives for energy- SIGPLAN Conf. Programming Language harvesting devices, such as energy guards, keep- Design and Implementation (PLDI), 2015, alive assertions, energy watch points, and energy pp. 575–585. breakpoints. Students in our lab and at a grow- 6. A. Colin and B. Lucia, “Chain: Tasks and ing list of other academic institutions have Channels for Reliable Intermittent Pro- successfully used EDB to debug and profile grams,” Proc. ACM SIGPLAN Int’l Conf. applications in scenarios similar to the case Object-Oriented Programming, Systems, Lan- studies we evaluated. guages, and Applications, 2016, pp. 514–530. EDB’s low-cost, compact hardware design makes it suitable for incorporation into next- 7. B. Ransford, J. Sorber, and K. Fu, “Mementos: generation debugging tools and for field System Support for Long-Running Computa- deployment with a target device. In the field, a tion on RFID-Scale Devices,” Proc. 16th Int’l future automatic diagnostic system could lev- Conf. Architectural Support for Programming erage EDB to catch rare bugs and automati- Languages and Operating Systems, 2011, pp. cally log memory states from the target device. 159–170. In the lab, EDB can serve research projects 8. K. Ma et al., “Architecture Exploration for that require data on energy consumption and Ambient Energy Harvesting Nonvolatile Pro- program execution on an energy-harvesting cessors,” Proc. IEEE 21st Int’l Symp. High platform, such as an intermittence-aware com- Performance Computer Architecture (HPCA), piler analysis. 2015, pp. 526–537...... 124 IEEE MICRO 9. D. Balsamo et al., “Hibernus: Sustaining Sample received a PhD in electrical engineer- Computation During Intermittent Supply for ing from the University of Washington. He is Energy-Harvesting Systems,” IEEE Embedded amemberofIEEEandACM.Contacthim Systems Letters, vol. 7, no. 1, 2015, pp. at [email protected]. 15–18. 10. M. Buettner, B. Greenstein, and D. Wetherall, Brandon Lucia is an assistant professor in “Dewdrop: An Energy-Aware Task Sched- the Department of Electrical and Computer uler for Computational RFID,” Proc. 8th Engineering at Carnegie Mellon University. USENIX Conf. Networked Systems Design His research interests include the boundaries and Implementation (NSDI), 2011, pp. between computer architecture, compilers, 197–210. system software, and programming lan- guages, applied to emerging, intermittently 11. V. Sundaram et al., “Diagnostic Tracing for powered systems and efficient parallel sys- Wireless Sensor Networks,” ACM Trans. tems. Lucia received a PhD in computer sci- Sensor Networks, vol. 9, no. 4, 2013, pp. ence and engineering from the University of 38:1–38:41. Washington. He is a member of IEEE and 12. J. Yang et al., “Clairvoyant: A Comprehen- ACM. Contact him at [email protected] sive Source-Level Debugger for Wireless or http://brandonlucia.com. Sensor Networks,” Proc. 5th Int’l Conf. Embedded Networked Sensor Systems (SenSys 07), 2007, pp. 189–203.

Alexei Colin is a graduate student in the Department of Electrical and Computer Engineering at Carnegie Mellon University. His research interests include reliability, pro- grammability, and efficiency of software on energy-harvesting devices. Colin received an MSc in electrical and computer engineering from Carnegie Mellon University. He is a student member of ACM. Contact him at [email protected].

Graham Harvey is an associate show elec- tronic engineer at Walt Disney Imagineering. His research interests include real-world appli- cations of wireless technologies to enhance guest experiences in themed environments. Harvey received a BS in electrical and com- puter engineering from Carnegie Mellon Uni- versity. He completed the work for this article while interning at Disney Research Pitts- burgh. Contact him at graham.n.harvey@ disney.com.

Alanson P. Sample is an associate lab direc- tor and principal research scientist at Disney Research Pittsburgh, where he leads the Wireless Systems group. His research inter- ests include enabling new guest experiences and sensing and computing devices by apply- Read your subscriptions through ing novel approaches to electromagnetics, RF the myCS publications portal at http://mycs.computer.org. and analog circuits, and embedded systems...... MAY/JUNE 2017 125 Awards ...... Insights from the 2016 Eckert- Mauchly Award Recipient

URI WEISER Technion–Israel Institute of Technology

...... I appreciate the opportunity to puters in our daily lives. To be at just the others) toward new systolic array graphics share with you the insights I presented in right time and place may not be pure luck. and analytical approaches. This exposure my Eckert-Mauchly Award acceptance If it happens again and again, it means to the industry and academia outside of speech at the 43rd International Sympo- that you keep trying to make a difference. Israel set me on my technical path. sium on Computer Architecture (ISCA) held in Seoul, South Korea, in June 2016. The Passion Path Industry I would like to thank the Editor in Chief of I was born in Tel Aviv, Israel. My parents Thereafter, at (in IEEE Micro, Lieven Eeckhout, for this were German Jews who fled Germany the mid-1980s), I was lucky to lead the opportunity. before the Holocaust in 1933. The culture design of the CISC NS32532 processor, I am humbled and honored to have I was exposed to during my childhood thebestmicroprocessoratthattime.I received the ACM-IEEE Computer Soci- was heavily influenced by the necessity learned there how a small team of excited ety Eckert-Mauchly Award. During my of constantly being in survival mode. The professionals could achieve the impossi- nearly 40 years in the field of computer main theme was to do your utmost to ble (OS run on first silicon). The product architecture, I have had the privilege of excel, move forward, and survive. was a huge technical success, but working with many architects, profes- The values I was nourished on were unfortunately, the market had already sors, and students at the University of to take the road less traveled by, to look shifted to the “other” CISC processors Utah, National Semiconductor, Intel, and for new directions, and to challenge the (68000, MIPS, PowerPC, and X86). the Technion–Israel Institute of Technol- status quo even when the target seems ogy and to collaborate with many col- unobtainable: the obligation to innovate NS32532 insight: Technology is important; leagues in industry and academia around in order to find new paths, to debate con- having the market behind you is a must. the world. I see this award as recognition structively on any solution, to crystallize of the computer architecture researchers the proposed solution, and to be passion- With this strong insight, I moved to I have worked with in Israel and abroad. ate about whatever you do. Intel in the late 1980s. Intel’s market for I was fortunate to work on several the X86 was huge compared to the market state-of-the-art concepts in research and Education for any other microprocessor. As Nick Tre- development that impacted the industry Soon after graduating with a BSc in electri- dennick said in his 1988 talk, “More 386s and academia alike. Computerization is cal engineering from the Technion and are produced at Intel between coffee one of the most rapidly developing trends completing my MSc degree while working break and lunch than the number of RISC in human history, influencing almost every at the Israeli DoD, I made an audacious chips produced all year by RISC vendors.” aspect of our lives, as it will continue to do decision to pursue my PhD studies in com- However, Intel management’s belief in the for a long time to come. Thus, in this field, puter science abroad. I had a few options X86 product path was not strong enough. it will always be the right time to innovate. and ultimately chose the University of Intel’s processors (the X86 family) In these exciting times, I was lucky to Utah. At Utah, with Professor Al Davis as were based on the “old” complex- be involved in developing new computer my advisor (and I may also say my friend), I instruction-set computer (CISC) architec- architecture concepts and products that was exposed to computer architecture ture, while a few years before IBM (with have changed the way we use com- and helped pave the way (together with its 801 processor) and Berkeley initiated a ......

126 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE ...... Reading List  U. Weiser and A.L. Davis, “Wavefront Notation Tool for VLSI Array Design,” VLSI System and Computation, H.T. Kung, R.F. Sproull, and G.T. Steele, eds., Computer Science Press, 1981, pp. 226–234.  L. Johnson et al., “Towards a Formal Treatment of VLSI Arrays,” Proc. Caltech Conf. VLSI, 1981; http://authors.library.caltech.edu /27041/1/4191 TR 81.pdf.  U. Weiser et al., “Design of the NS32532 MicroProcessor,” Proc. IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors, 1987, pp. 177–180.  A. Peleg and U. Weiser, Dynamic Flow Instruction Cache Memory Organized Around Trace Segments Independent of Virtual Address Line, US patent 5,381,533, to Intel, Patent and Trademark Office, 1995.  A. Peleg, S. Wilkie, and U. Weiser, “Intel MMX for Multimedia PCs,” Comm. ACM, vol. 40, no. 1, 1997, pp. 25–38.  A. Peleg et al., The Complete Guide to MMX, McGraw-Hill, 1997.  T.Y. Morad, U.C. Weiser, and A. Kolodny, ACCMP—Asymmetric Cluster Chip Multiprocessing, tech. report 488, Dept. Electrical Eng., Technion, 2004.  T. Morad et al., “Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip MultiProcessors,” IEEE Computer Archi- tecture Letters, vol. 5, no. 1, 2006, pp. 14–17.  T. Morad, A. Kolodny, and U.C. Weiser, Multiple Multithreaded Applications on Asymmetric and Symmetric Chip MultiProcessors, tech. report 701, Dept. Electrical Eng., Technion, 2008.  Z. Guz et al., “Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture,” Proc. 20th Ann. Symp. Parallelism in Algorithms and Architectures Conf., 2008; doi:10.1145/1378533.1378535.  Z. Guz et al., “Multi-core vs. Multi-thread Machines: Stay Away from the Valley,” IEEE Computer Architecture Letters, vol. 8, no. 1, 2009, pp. 25–28.  T. Zidenberg, I. Keslassy, and U. Weiser, “Optimal Resource Allocation with MultiAmdahl,” Computer, vol. 46, no. 7, 2013, pp. 70–77.  L. Peled et al., “Semantic Locality and Context-Based Prefetching Using Reinforcement Learning,” Proc. ACM/IEEE 42nd Ann. Int’l Symp. Computer Architecture (ISCA), 2015; doi:10.1145/2749469.2749473.  T. Morad et al., “Optimizing Read-Once Data Flow in Big-Data Applications,” IEEE Computer Architecture Letters, 2016; doi:10.1109 /LCA.2016.2520927.

new direction—the reduced-instruction- data caches, the X86 processors could be ter in Santa Clara, California. There, I led set computer (RISC) processor. A debate made to perform competitively against a group of researchers and strategists emerged within the computing commun- the RISC-based processors. Part of this who formulated the first PCI definition, ity as to whether the RISC design would process included a superb one-page tech- defined Intel’s CPU Roadmap, and pro- eclipse the old CISC design. Intel was nical document titled “Do Not Kill the posed a systems solution for the Pen- contemplating whether to design a new Golden Goose,” which was sent to Intel’s tium processor. This group formed the X86 processor using the CISC concept or then-CEO, Dr. Andy Grove, and his staff. foundation of the Intel Research Labora- abandon the program and shift from the The debate within Intel took several tories, established a few years later. company’s central product toward a new months, and finally the decision was Shortly after enhancing Intel’s line of RISC architecture–based microprocessor made to design the next-generation micro- CISC-based processors in the early 1990s, (the i860 family). Moving to a new archi- processor based on the old X86 CISC fam- I co-invented and led the development of tecture meant losing SW computability— ily. The architecture enhancements we the MMX architecture. This was the first that is, writing new software for the entire proposed laid the foundation for Intel’s time (after i386) that Intel added a full set application base. first Pentium processor. of instructions to its X86 architecture. The At that time, together with a few other set of 57 instructions was based on a 64- architects, I passionately tried to convince Pentium insight: Understand the environ- bit single-instruction, multiple-data (SIMD) Intel executives to continue developing a ment; do not follow the trend; be innova- instruction set that improved performance new generation of CISC-based X86 pro- tive, passionate, and involved. Do not give of digital signal processing, media and cessors. We did this by showing how, up; prove that your way is the right way. graphics processing, speech recognition, with the addition of new microarchitecture and video encoding/decoding. The new features such as superscalar execution, Thereafter, I was lucky to be invited MMX-based processor (P55C designed in branch predication, split instructions, and to lead Intel’s Platform Architecture Cen- Israel) was a huge success in the market...... MAY/JUNE 2017 127 ...... AWARDS

MMX insight: Marketing has a tremen- microarchitecture concept that increases The introduction of the new Big Data dous impact on your success. performance and reduces power con- environment calls for re-evaluation of our sumption by storing in-cache the dynamic existing solutions. We started to direct Later, Intel invited me to co-lead the trace-flow of instructions that have our research toward a more effective sol- foundations of a new design center in already been fetched and decoded. This ution for the new environment and came Austin, Texas, the Texas Development innovation brought about a fundamental up with two new concepts: the Funnel, a Center (TDC). At Intel, management usu- change in the design principles of high- computing element whose input band- ally provides the vision, whereas the performance . A trace width is much wider than its output band- strategy is defined bottom up. We had to cache concept was incorporated into width, and the Non-Temporal Locality define our product path and convince more than 500 million Intel Pentium 4 pro- Memory Access, exhibited by some Big Intel to adopt our strategy. Establishing a cessors that Intel sold. Digital’s EV8 used Data applications. new Intel design center is a challenging this architecture enhancement, too. task: hiring and building a team, defining a Big Data insight: Watch for a change in local culture, defining a new product path, Trace Cache insight: Always continue to the environment and the validity of cur- and striving for recognition inside Intel. look for new research avenues. Not all rent computing solutions. We often will be successful, but some may be. need to tune, change, and/or adapt our Establishing a new design center insight: computing structure to accept the new This challenging task required me to cover The limitation of general-purpose pro- requirements. the varied domains of architecture, estab- cessor performance under a limited power lishing a local culture, hiring, and building envelope became a major performance y professional path from the indus- a leading team. hurdle. This drove me to strive for better Mtry to academia was, in a way, a calculated decision. Its purpose was to pro- After returning to Israel, I realized that performance/power architecture and led long my technical career. Academia pro- processors were reaching the perform- me to pursue the concept of heterogene- vides a limitless professional trail (as long ance wall when operating under a limited ous computing in general and asymmetric as you are productive) not always available power envelope environment. By nar- computing in particular. Initially conceived in the industry. In addition, academia keeps rowing the application range, accelera- (as mentioned) for speeding up high- you in young, vibrant surroundings in which tors can achieve better performance and throughput media applications, the con- the research targets are to look forward, better performance/power than general- cept of heterogeneous computing later innovate, and open new technological ave- purpose processors. I realized that an served as a means to improve perform- nues for the industry to follow. on-die accelerator can provide a better, ance and efficiency by using “big cores” I believe that the current slowdown in more comprehensive solution. I formu- for serial code and “small cores” for paral- the process technology trend, combined lated a new concept called a Streaming lel code and low-energy consumption. with technological limitations on energy Processor (StP). We defined the concept Together with a colleague and a stu- and power, will place the burden of revital- (an X86-based media processor), archi- dent, I investigated the fundamentals of izing computing technology on research- tecture, SW model, and application range. heterogeneous computing. This research ers in the computer architecture field. The main purpose was an on-die X86- led to new insights such as the Multi- Thus, I believe that we are on the verge of based media (streaming) coprocessor. Amdahl concept, an analytical based new architectural findings. The perform- Intel had to choose between two optimal resource division in heteroge- ance/application/capability baton is being options: an X86 graphics processor or neous systems, and the Multi-Core vs. handed to you, the architects. Take it, and an X86 media processor. Intel manage- Multithread concept, which avoids the go do wonderful things! ment chose the graphic path (Larrabee). performance valley in multiple core envi- ronments. Additional research activities I have enjoyed being part of a group Streaming Processor insight: When you included new architecture paradigms of architects that made big changes in dare, sometimes you fail. such as Nahalal, a specialized cache computer architecture, and I continue to architecture for multicore systems, and enjoy the interactions, the passion, and Academia Continuous Flow Multithreading, which the unforgettable ride. MICRO Along with my industrial work, I kept my uses memristors to allow fine-grained ties with the academic world. I continued switch-on-event multithreading. publishing papers, taught and advised Uri Weiser is an emeritus professor in graduate students, and participated in pro- Heterogeneous insight: Technology the Electrical Engineering Department at fessional conferences. In 1990, together changes over time. Be ready to take the Technion–Israel Institute of Technol- with one of my students at the Technion, advantage of the changes that may lead ogy (IIT). Contact him at uri.weiser@ee Alex Peleg, I invented the Trace Cache, a to new avenues. .technion.ac.il...... 128 IEEE MICRO

Micro Economics ...... Two Sides to Scale

SHANE GREENSTEIN Harvard Business School

...... It used to be that only AT&T, oil Let’s compare and contrast those conceivable product category. In the companies, and Soviet enterprises could views. process, it developed an operation to aspire to monstrous size. Technology support the worldwide sale and distribu- firms entered that club only in rare cir- Scale Is a Moving Target tion of its products, achieving a scale cumstances, and when IBM, Intel, and Scale cannot be achieved without opera- never before seen in any retailer other Microsoft did so, they each found their tions that produce and deliver many serv- than Walmart. own path to headlines. ices or products whose price exceeds Here is the remarkable part. Walmart We live in a different era today. The their cost. Accordingly, one advantage of does not rent out its warehouses and largest organizations on the planet are scale is replication of operations. Take trucks and computing and order fulfill- leading technology firms. These firms Alphabet’s search engine, Google. Their ment staff to anyone else. It does not rent aspire to sell tens of billions of dollars of engineers learned to parse the web in its staff’s insights about how to secure products and services, employ hundreds one language and extended the approach its IT, nor its management’s insights of thousands of workers, and lure invest- to other languages. Software algorithms about how to fulfill customer demand. ors to value their organizations at hun- that worked in one language can work in It regards these as foundational trade dreds of billions of dollars. They deploy others. User-oriented processes that secrets. worldwide distribution, complemented helped build loyalty in one language build Amazon’s management, in contrast, by worldwide supply of inputs, growing it in another—for example, “Did you took these services and developed, brand names recognized in Canada, the mean to misspell that word, or would refined, and standardized their use for Kalahari Desert, and the Caribbean. you prefer an alternative?” others. In their retail operations, they both Each of the big four—Alphabet, Ama- That does not happen by itself. Proc- resell for others and perform order fulfill- zon, Apple, and Facebook—have achieved esses must be well documented, and ment for others. They also developed a this unprecedented scale. Large and val- the knowledge about them must pass range of back-end services to sell to uable? Check. Hundreds of thousands of between employees. Replication then others, layered on top of additional options employees? At last count. Global in aspira- yields gains the second and third and for software applications and a range of tions and operations? You bet. Endless eighteenth time. To say it another way, needs. It is called Amazon Web Services opportunities in front of them? So it Google faces a lower cost supporting (AWS). Both of these are available to other seems. A few others—such as Microsoft, search in another language than anyone firms, even some of Amazon’s ostensible Intel, IBM, SAP, and Oracle—could round supporting just one. competitors in retail services. out a top 10 list. A few more young firms— At one level, this is not new. Others I cannot think of any other compara- such as Uber, Airbnb, and Alibaba—aspire have benefited from the economics of ble historical example where a large firm to be the tenth tomorrow. replication in the past. Technology firms has developed such scale, and also Mainstream economics regards this brought it to new heights owing to the rise grown by making its processes available scale with either praise or alarm. One of the worldwide Internet and the ease of to others. (If you can think of one, feel view marvels at the spread of such star- moving software to new locations. free to suggest it. I would love to hear tling efficiency, low prices, and wide vari- Consider Amazon, which started its from you.) ety. A contrasting view worries about life as an electronic retailer and never let To be sure, there is more than eco- the distortions from concentrating deci- up on relentless expansion. It started in nomics behind this achievement. After all, sion making in a single large firm. books, and now has expanded to every the malleability and mobility of software ......

130 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE also contributes to the scale seen in these erial being shared. A platform polluted matter why he did it. Jobs’ decision two examples. So too does the legal sys- with lies does not attract participation. devalued a set of skills held by many tem, as writers such as Tom Friedman The point is this: all firms mess up. programmers, reducing the return on have noted. This worldwide scale takes When large firms mess up, more of soci- investments those programmers made advantage of all the efforts to coordinate ety pays the cost. in refining and developing those skills. global markets over the last half century. More to the point, scale makes com- That is not the only form dependence Diplomats went to great effort decades petitive discipline more difficult to apply takes. Once again, Alphabet provides a ago to standardize processes for imports when firms mess up. For example, many good illustration in its Google News serv- and exports and remove frictions in the years ago Apple had a series of policies . Whether you like it or not, Google settlement of accounts across interna- for its new smartphone that prevented News has consistently interpreted copy- tional boundaries. developers from spreading porn and right law online in a way that permits It is still funny when you think about it. gambling apps, and that made a lot of them to show parts of another party’s These frictions were reduced to benefit sense. But Apple kept expanding its content. For understandable reasons, and the prior generation’s global companies, requirements, eventually angering many like almost all news organizations world- such as McDonald’s and IBM, not to programmers with rules about owning wide, Spanish newspapers were among mention Coca-Cola, Boeing, Caterpillar, data. That gave an opening to an alterna- the complainers. But they took one more and Goldman Sachs. Today these same tive platform with less restrictive rules, step, and had their country’s legislature rules benefit several firms selling services and Android took advantage of that pass a law requiring Google to pay for the that Ray Kroc and Thomas Watson Jr. opportunity. In short, that is competitive content, even small slices. Long story never could have imagined. discipline incarnate: when a big firm mes- short, Google refused to pay, and it shut ses up, competitors gain. down Google News in Spain. In no time, Decision Making Ah, therein lies the problem. Scale those newspapers saw their traffic drop, A less appealing attribute accompanies can sometimes provide almost impene- and they were begging to get it back. scale: concentration of decision making. trable insulation from competitive disci- Now that is dependence for you. To begin, let’s recognize that popular pline. As noted earlier, for example, in I am not going to argue about who discussion often gets this one wrong. many languages nobody can challenge was right or wrong. Rather, my point is Hollywood likes dystopian conspiracy Google, so, effectively, nobody can disci- this: such dependence tends to arise in theories in which quasi-evil executives pline them when they mess up. And in virtually every setting in which scale manipulate society for selfish reasons. the earlier example, who stepped in concentrates authority. And so the self- However, the problems are usually less when Facebook messed up? What were interested strategic decisions of one set sinister than that. Even with the best- the alternatives? In other words, the of executives has consequences for so intentioned executives, the biggest firms absence of competitive discipline arises many others. We have already seen that make decisions that can have enormous occasionally, almost by definition, when- when firms mess up, many pay a cost. consequences, many unintended. ever large-scale firms are involved. Even when leading firms don’t mess up, Facebook’s recent travails are a good Perhaps more awkwardly, today’s plat- their intended decisions can impose illustration of one type of problem. Recall forms support large ecosystems, in which worry on others. that, despite multiple complaints about the leading firms coordinate many actions the manipulation of its algorithm and in that ecosystem. Occasionally, a leading advertising program during the election, firm bullies a smaller one in the ecosys- he spread of efficiency is breathtak- Facebook refused to intervene in policing tem, but the more common issue might T ing. The potential dangers from con- fake news stories, many of which were be called “uncomfortable dependence.” centrating managerial decision making invented out of whole cloth for the pur- Let’s pick on Apple again for an illus- are worrying. poses of making some ad revenue from a tration of dependence. Years ago, and for After looking more closely, these do hyped electorate. After the election, Mark a number of reasons, Steve Jobs refused not seem like two different perspec- Zuckerberg cloaked his firm’s behavior in to let Flash deploy on Apple products. tives. These are more like yin and yang: the language of free speech and user Whether you think those motives were it is not possible to have one without choice. justified or selfish (which typically gets the other—and they are an unavoidable What a tin ear from Zuck. Irrespective the attention in a conversation about this feature of our times. MICRO of your short-term political outlook, topic), let’s focus on the more mundane invented news is plainly not good for fact that nobody questions: one person democratic societies. And, more nar- held enough authority to kill a part of the Shane Greenstein is a professor at the rowly, Facebook’s long-term fortunes ecosystem, which until then had been a Harvard Business School. Contact him depended on the credibility of the mat- thriving software language. It does not at [email protected]...... MAY/JUNE 2017 131