IEEE MICRO Reflections from Uri Weiser p. 126 m a y/june 2017
The magazine for chip and silicon systems designers Top Picks from the 2016 Computer Architecture Conferences
May/June 2017 Volume 37, Number 3 VOL UME 37
NUM B E R 3 www.computer.org/micro
May/June 2017 Volume 37 Number 3
Features
6 Guest Editors’ Introduction: Top Picks from the 2016 Computer Architecture Conferences Aamer Jaleel and Moinuddin Qureshi 12 Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators Yu-Hsin Chen, Joel Emer, and Vivienne Sze 22 The Memristive Boltzmann Machines Mahdi Nazm Bojnordi and Engin Ipek 30 Analog Computing in a Modern Context: A Linear Algebra Accelerator Case Study Oliver Burston Yipeng Huang, Ning Guo, Mingoo Seok, Yannis Tsividis, and Simha Debut Art Sethumadhavan [email protected] 40 Domain Specialization Is Generally Unnecessary for Accelerators Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright 52 Configurable Clouds Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Daniel Firestone, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger 62 Specializing a Planet’s Computation: ASIC Clouds Moein Khazraee, Luis Vega Gutierrez, Ikuo Magaki, and Michael Bedford Taylor 70 DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York, NY 10016-5997; IEEE Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna T. Malladi, Hongzhong Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Zheng, Bob Brennan, and Christos Kozyrakis Computer Society Publications Office, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. 80 Agile Paging for Efficient Memory Virtualization Subscribe to IEEE Micro by visiting www.computer.org/micro. Jayneel Gandhi, Mark D. Hill, and Michael M. Swift Postmaster: Send address changes and undelivered copies to IEEE, Membership Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid 88 Transistency Models: Memory Ordering at the Hardware–OS Interface at New York, NY, and at additional mailing offices. Canadian GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Daniel Lustig, Geet Sethi, Abhishek Bhattacharjee, and Margaret Martonosi Return undeliverable Canadian addresses to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. 98 Toward a DNA-Based Archival Storage System Reuse rights and reprint permissions: Educational or personal use of this material is James Bornholt, Randolph Lopez, Douglas M. Carmean, Luis Ceze, Georg Seelig, and permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice Karin Strauss and a full citation to the original work on the first page of the copy; and 3) does not imply IEEE endorsement of any third-party products or services. Authors and their companies are permitted to post the accepted version of IEEE-copyrighted material on their own 106 Ti-states: Power Management in Active Timing Margin Processors webservers without permission, provided that the IEEE copyright notice and a full Yazhou Zu, Wei Huang, Indrani Paul, and Vijay Janapa Reddi citation to the original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate review 116 An Energy-Aware Debugger for Intermittently Powered Systems suggestions, but not the published version with copy-editing, proofreading, and for- Alexei Colin, Graham Harvey, Alanson P. Sample, and Brandon Lucia matting added by IEEE. For more information, please go to www.ieee.org /publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising, or promo- Departments tional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual Property Rights Office, 4 From the Editor in Chief 445 Hoes Lane, Piscataway, NJ 08854-4141 or [email protected]. Copyright 2017 by IEEE. All rights reserved. Thoughts on the Top Picks Selections Abstracting and library use: Abstracting is permitted with credit to the source. Lieven Eeckhout Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through 126 Awards the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Insights from the 2016 Eckert-Mauchly Award Recipient Editorial: Unless otherwise stated, bylined articles, as well as product and service descrip- tions, reflect the author’s or firm’s opinion. Inclusion in IEEE Micro does not necessarily Uri Weiser constitute an endorsement by IEEE or the Computer Society. All submissions are subject to Micro Economics editing for style, clarity, and space. IEEE prohibits discrimination, harassment, and bullying. 130 For more information, visit www.ieee.org/web/aboutus/whatis/policies/p9-26.html. Two Sides to Scale Shane Greenstein Computer Society Information, p. 3 Advertising/Product Index, p. 61 ...... 2 ...... EDITOR IN CHIEF EDITORIAL STAFF .... Lieven Eeckhout Editorial Product Lead Ghent University Cathy Martin [email protected] [email protected] ...... Editorial Management ADVISORY BOARD Molly Gamborg David H. Albonesi, Erik R. Altman, Pradip Bose, Publications Coordinator [email protected] Kemal Ebcioglu, Michael Flynn, Ruby B. Lee, ...... Yale Patt, James E. Smith, and Marc Tremblay ...... Director, Products & Services Evan Butterfield EDITORIAL BOARD Senior Manager, Editorial Services David Brooks Robin Baldwin Harvard University Manager, Editorial Services Alper Buyuktosunoglu Brian Brannon IBM Manager, Peer Review & Periodical Bronis de Supinski Administration Lawrence Livermore National Laboratory Hilda Carman Natalie Enright Jerger Digital Library Marketing Manager University of Toronto Georgann Carter Babak Falsafi EPFL Senior Business Development Manager Shane Greenstein Sandra Brown Northwestern University Director of Membership Lizy Kurian John Eric Berkowitz University of Texas at Austin Digital Marketing Manager Hyesoon Kim Marian Anderson Georgia Tech [email protected] ...... John Kim KAIST EDITORIAL OFFICE Hsien-Hsin (Sean) Lee PO Box 3014, Los Alamitos, CA 90720; Taiwan Semiconductor Manufacturing Company ...... (714) 821-8380; [email protected] ...... Richard Mateosian ...... Trevor Mudge ...... Submissions: University of Michigan, Ann Arbor Shubu Mukherjee https://mc.manuscriptcentral.com/micro-cs Cavium Networks Author guidelines: Onur Mutlu http://www.computer.org/micro ...... ETH Zurich Toshio Nakatani IEEE CS PUBLICATIONS BOARD IBM Research Greg Byrd (VP for Publications), Alfredo Benso, Irena Vojin G. Oklobdzija Bojanova, Robert Dupuis, David S. Ebert, Davide University of California, Davis Falessi, Vladimir Getov, Jose Martinez, Forrest Ronny Ronen Shull, and George K. Thiruvathukal Intel ...... Kevin W. Rudd Laboratory for Physical Sciences IEEE CS MAGAZINE OPERATIONS Andre´ Seznec COMMITTEE INRIA George K. Thiruvathukal (Chair), Gul Agha, M. Brian Per Stenstro¨m Chalmers University of Technology Blake, Jim X. Chen, Maria Ebling, Lieven Eeckhout, Richard H. Stern Miguel Encarnac¸a˜o, Nathan Ensmenger, Sumi Helal, George Washington University Law School San Murugesan, Yong Rui, Ahmad-Reza Sadeghi, Lixin Zhang Diomidis Spinellis, VS Subrahmanian, and Mazin Chinese Academy of Sciences Yousif ......
...... MAY/JUNE 2017 3 From the Editor in Chief ...... Thoughts on the Top Picks Selections
LIEVEN EECKHOUT Ghent University
...... TheMay/JuneissueofIEEE The selection committee reached a many committee members flying in and Micro traditionally features a selection of consensus on 12 Top Picks and 12 Hono- out from all over the world). articles called Top Picks that have the rable Mentions. Top Pick selections Glancing over the set of papers potential to influence the work of com- were invited to prepare an article to be selected for Top Picks and Honorable puter architects for the near future. A included in this special issue. Because Mentions, one important trend has selection committee of experts selects these magazine articles are much shorter emerged just recently—namely, the these articles from the previous year’s than the original conference papers, they focus on accelerators and hardware computer architecture conferences; the tend to be more high-level and more specialization. A good number of papers selection criteria are novelty and potential qualitative than the original conference are related to hardware acceleration in for long-term impact. Any paper published publications, providing an excellent intro- the broad sense. This does not come as in the top computer architecture confer- duction to these highly innovative contri- a surprise given current application ences of 2016 was eligible, which makes butions. The Honorable Mentions are top trends, along with the end of Dennard the job of the selection committee both a papers that the selection committee scaling, which pushes architects to challenge and a pleasure. Selections are unfortunately could not recognize as Top improve system performance within based on the original conference paper Picks because of magazine space con- stringent power and cost envelopes and a three-page write-up that summa- straints; these are acknowledged in the through hardware acceleration. We rizes the paper’s key contributions and Guest Editors’ Introduction. I encourage observe this trend throughout the entire potential impact. We received a record you to read these important contribu- computing landscape, from mobile devi- number of 113 submissions this year. tions to our field and share your thoughts ces to large-scale datacenters. There is a Aamer Jaleel and Moinuddin Qureshi with students and colleagues. lot of exciting research and advanced chaired the selection committee, which Having participated in the selection development going on in this area by comprised 33 experts. I wholeheartedly committee myself, I was deeply im- many research groups in industry and aca- thank them and their committee for pressed by the effectiveness of the demia, and I expect many more important having done such a great job. As they new review process. In particular, I advances in the near future. Next to this note in the Guest Editors’ Introduction, found it interesting to observe that the emerging trend, there is (still) a good frac- Aamer and Moin introduced a novel two- committee reached a consensus that tion of outstanding papers in more tradi- phase review procedure. Four commit- very closely aligned with the ranking tional areas, including microarchitecture, tee members reviewed each paper obtained by the 10 reviews for each of memory hierarchy, memory consistency, during the first round. A subset of the the second-round papers. This makes multicore, power management, security, papers was selected to move to the me wonder whether we still need an in- and simulation methodology. second round based on the reviewers’ person selection committee meeting. Iwanttoshareacouplemore scores and online discussion of the first Of course, the meeting itself has great thoughts with you regarding the Top Picks round. Six more committee members value in terms of generating interesting procedure that arose from conversations reviewed each paper during the second discussions and providing the opportu- I’ve had with various people in our com- round; second-round papers thus received nity to meet colleagues from our com- munity. I’d love to get the broader com- a total of 10 reviews! This formed the munity, but it undeniably also imposes munity’s feedback on this, so please basic input for the in-person selection a big cost in terms of time, effort, don’t hesitate to contact me and share committee meeting. money, and carbon footprint (with your thoughts......
4 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE One thought relates to the number of waiting a few more years before under- Uri Weiser single-handedly convinced selected Top Picks being too restrictive. standing the true value of a novel Intel executives to continue designing There is a hard cap of only 12 Top Picks. research contribution and how it impacts CISC-based x86 processors by showing On one hand, we want the process to be our field. An important argument in this that through adding new features such selective and Top Picks recognition to be discussion is that awards are generally as superscalar execution, branch predica- prestigious. On the other hand, our com- more important to young researchers tion, split instruction, and data cache, the munity is growing. Our top-tier conferen- than they are for senior researchers. x86 processors could be made competi- ces, such as ISCA, MICRO, HPCA, and Young researchers looking for a faculty or tive against the RISC family of process- ASPLOS, receive an ever-increasing research position in a leading academic ors initiated by IBM and Berkeley. This number of papers to review, and the institute or industry lab need recognition laid the foundation for the Intel Pentium number of accepted papers is increasing fairly soon in their careers as they get in processor. Uri Weiser made several as well. One could argue that in response competition with other researchers from other seminal contributions, including we need to recognize more papers as other fields that have more awards. the design of instruction-set extensions Top Picks. The hard constraint that we Senior researchers, on the other hand, do (that is, Intel’s MMX) for supporting mul- are hitting here is the page limit we have not need the recognition as much—or at timedia applications. The Eckert-Mauchly for the magazine, because the number least their time scale is (much) longer. Award is considered the computer archi- of pages is related to the production Please let me know your thoughts on tecture community’s most prestigious cost. One solution may be to have more these ideas or any other concerns you award. I wholeheartedly congratulate Uri Top Picks selections but fewer pages may have. I’m open to any suggestions. Weiser on the award and thank him for allocated per selected article—but this My only concern is to make sure Top his insightful testimonial. may compromise the comprehensive- Picks continues to recognize the best With that, I wish you happy reading, ness of the articles. Another solution research in our field while serving the as always! may be to recognize more Honorable best interests of both the community Mentions, because they don’t affect the and IEEE Micro. Lieven Eeckhout page count. Or, we may want to elec- Before wrapping up, I want to high- Editor in Chief tronically publish the three-page Top light that this issue also includes an IEEE Micro Picks submissions (paper summary and award testimonial. Uri Weiser received potential impact, as mentioned earlier) as the 2016 Eckert-Mauchly Award for his Lieven Eeckhout is a professor in the they are, if the authors agree. This would seminal contributions to the field of com- Department of Electronics and Informa- not incur any production cost at all, yet puter architecture over the course of his tion Systems at Ghent University. Con- the community would benefit from read- 40-year career in industry and academia. tact him at [email protected]. ing them. Yet another solution may be to select more than 12 Top Picks and pub- lish them in different issues of the maga- zine. The counterargument here is that we have only six issues per year, which makes it difficult to argue for more than one issue devoted to Top Picks. Another issue relates to the timing of the Top Picks selection. Our community has relatively few awards, and Top Picks is an important vehicle in our community to recognize top-quality research. How- ever, one may argue whether selecting Top Picks one year after publication is too soon—it might make sense to wait a couple more years before recognizing the best research contributions of the year. We may not want to wait as long as the ISCA’s Influential Paper Award (15 years after publication) and MICRO’s Test of Time Award (18 to 22 years after publication), but still, one could argue for ...... MAY/JUNE 2017 5 ...... Guest Editors’ Introduction ...... TOP PICKS FROM THE 2016 COMPUTER ARCHITECTURE CONFERENCES
...... It is our pleasure to introduce this pared to previous years’ Top Picks, keeping year’s Top Picks in Computer Architecture. in mind the constraints and objectives that This issue is the culmination of the hard are unique to Top Picks. The conventional work of the selection committee, which chose approach to Top Picks selection has largely from 113 submissions that were published in remained similar to that used in our confer- computer architecture conferences in 2016. ences (for example, four to five reviews per We followed the precedent set by last year’s paper and a four-to-six-point grading scale). co-chairs and encouraged the selection com- For Top Picks, the number of papers that can mittee members to consider characteristics be accepted is fixed (11 to 12), and the selec- that make a paper worthy of being a “top tion committee’s primary job is to identify pick.” Specifically, we asked them to consider the top 12 papers out of all the submitted whether a paper challenges conventional papers, instead of providing a detailed cri- wisdom, establishes a new area of research, is tique of the technical work and how the the definitive “last word” in an established paper can be improved. The papers submit- research area, has a high potential for indus- ted to Top Picks tend to be of much higher try impact, and/or is one they would recom- (average) quality than the typical paper sub- mend to others to read. mitted at our conferences, and in many cases Since the number of papers that could be the reviewers are already aware of the work selected for this Top Picks special issue was (through prior reviewing, reading the papers, limited to 12, we continued the precedent set or attending the presentations). Therefore, Aamer Jaleel over the past two years of having the selection the time and effort spent reviewing Top Picks committee recognize 12 additional high- papers tends to be less than that spent review- Nvidia quality papers for Honorable Mention. We ing the typical conference submissions. strongly encourage you to read these papers We identified two key areas in which the (see the “Honorable Mentions” sidebar). Top Picks selection process could be Moinuddin Qureshi Before we present the list of articles appearing improved. First, a small number of reviewers in this special issue, we will first describe the (approximately five) made the decisions for Georgia Tech new review process that we implemented to Top Picks. The confidence in selection could improve the paper selection process. be improved significantly by having a larger number of reviews (approximately 10) per paper, especially for the papers that are likely Review Process to be discussed at the selection committee A selection committee comprising 31 mem- meeting. This also ensures that reviewers are bers reviewed all the 113 papers (see the more engaged at the meeting and make “Selection Committee” sidebar). This year, informed decisions. Second, the selection of we tried a different selection process com- Top Picks gets overly influenced by excessively ......
6 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE ...... Honorable Mentions
Paper Summary “Exploiting Semantic Commutativity in Hardware Speculation” This paper introduces architectural support to exploit a broad class by Guowei Zhang, Virginia Chiu, and Daniel Sanchez (MICRO of commutative updates enabling update-heavy applications to 2016) scale to thousands of cores. “The Computational Sprinting Game” by Songchun Fan, Seyed Computational sprinting is a mechanism that supplies extra power Majid Zahedi, and Benjamin C. Lee (ASPLOS 2016) for short durations to enhance performance. This paper introduces game theory for allocating shared power between multiple cores. “PoisonIvy: Safe Speculation for Secure Memory” by Tamara Integrity verification is a main cause of slowdown in secure memo- Silbergleit Lehman, Andrew D. Hilton, and Benjamin C. Lee ries. PoisonIvy provides a way to enable safe speculation on unveri- (MICRO 2016) fied data by tracking the instructions that consume the unverified data using poisoned bits. “Data-Centric Execution of Speculative Parallel Programs” by The authors’ technique enables speculative parallelization (such as Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera, thread-level speculation and transactional memory) to scale to thou- Joel Emer, and Daniel Sanchez (MICRO 2016) sands of cores. It also makes speculative parallelization as easy to program as sequential programming. “Efficiently Scaling Out-of-Order Cores for Simultaneous This paper demonstrates that it is possible to unify in-order and Multithreading” by Faissal M. Sleiman and Thomas F. out-of-order issue into a single, integrated, energy-efficient SMT Wenisch (ISCA 2016) microarchitecture. “Racer: TSO Consistency via Race Detection” by Alberto Ros The authors propose a scalable approach to enforce coherence and and Stefanos Kaxiras (MICRO 2016) TSO consistency without directories, timestamps, or software intervention. “The Anytime Automaton” by Joshua San Miguel and Natalie This paper provides a general, safe, and robust approximate com- Enright Jerger (ISCA 2016) puting paradigm that abstracts away the challenge of guaranteeing user acceptability from the system architect. “Accelerating Markov Random Field Inference Using Molecular This paper proposes cross-layer support for probabilistic computing Optical Gibbs Sampling Units” by Siyang Wang, Xiangyu Zhang, using novel technologies and specialized architectures. Yuxuan Li, Ramin Bashizade, Song Yang, Chris Dwyer, and Alvin R. Lebeck (ISCA 2016) “Stripes: Bit-Serial Deep Neural Network Computing” by The authors demonstrate that bit-serial computation can lead to Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. high-performance and energy-efficient designs whose performance Aamodt, and Andreas Moshovos (MICRO 2016) and accuracy adapts to precision at a fine granularity. “Strober: Fast and Accurate Sample-Based Energy Simulation This paper proposes a sample-based RTL energy modeling method- for Arbitrary RTL” by Donggyu Kim, Adam Izraelevitz, Christo- ology for fast and accurate energy evaluation. pher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanovicc (ISCA 2016) “Back to the Future: Leveraging Belady’s Algorithm for The authors’ algorithm enhances cache replacement by learning Improved Cache Replacement” by Akanksha Jain and Calvin replacement decisions made by Belady. The paper also presents a Lin (ISCA 2016) novel mechanism to efficiently simulate Belady behavior. “ISAAC: A Convolutional Neural Network Accelerator with The authors advance the state of the art in deep network accelera- In-Situ Analog Arithmetic in Crossbars” by Ali Shafiee, Anir- tors by an order of magnitude and overcome the challenges of ana- ban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, log-digital conversion with innovative encodings and pipelines John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek suitable for precise and energy-efficient analog acceleration. Srikumar (ISCA 2016)
...... MAY/JUNE 2017 7 ...... GUEST EDITORS’INTRODUCTION
...... Selection Committee Tor Aamodt, University of British Columbia Debbie Marr, Intel Alaa Alameldeen, Intel Andreas Moshovos, University of Toronto Murali Annavaram, University of Southern California Onur Mutlu, ETH Zurich Todd Austin, University of Michigan Ravi Nair, IBM Chris Batten, Cornell University Milos Prvulovic, Georgia Tech Luis Ceze, University of Washington Scott Rixner, Rice University Sandhya Dwarkadas, University of Rochester Eric Rotenberg, North Carolina State University Lieven Eeckhout, Ghent University Karu Sankaralingam, University of Wisconsin Joel Emer, Nvidia and MIT Yanos Sazeidas, University of Cyprus Babak Falsafi, EPFL Simha Sethumadhavan, Columbia University Hyesoon Kim, Georgia Tech Andre Seznec, INRIA Nam Sung Kim, University of Illinois at Urbana–Champaign Dan Sorin, Duke University Benjamin Lee, Duke University Viji Srinivasan, IBM Hsien-Hsin Lee, Taiwan Semiconductor Manufacturing Karin Strauss, Microsoft Company Tom Wenisch, University of Michigan Gabriel Loh, AMD Antonia Zhai, University of Minnesota
harsh or generous reviewers, who either give or Honorable Mention, or neither. In the first scores at extreme ends or advocate for too phase, each reviewer was assigned exactly 14 few or too many papers from their stack. We papers and was asked to recommend exactly wanted to ensure that all reviewers play an five papers (Top 5) to the second phase. Each equal role in the selection, regardless of their paper received four ratings in this phase. If a harshness or generosity. For example, we paper got three or more ratings of Top 5, it could give all reviewers an equal voice by automatically advanced to the second phase. requiring them to advocate for a fixed num- If the paper had two ratings of Top 5, then ber of papers from their stack. We used the both positive reviewers had to champion the data from the past three years’ Top Picks paper for it to advance to the second phase. meetings to analyze the process for Top Picks Papers with less than two ratings of Top 5 did and used this data to drive the design of our not advance to the second phase. A total of process. For example, the typical acceptance 38 papers advanced to the second phase, and rate of Top Picks is approximately 10 per- each such paper got a total of 9 to 10 reviews. cent; therefore, if we assign 15 papers to each In the second phase, each reviewer was reviewer, then each reviewer can be expected assigned an additional seven to eight papers to have only 1.5 Top Picks papers on average in addition to the four to five papers that sur- in their stack, and the likelihood of having 5 vived the first phase. Each reviewer had 12 or more Top Picks papers in the stack would papers and was asked to place exactly 4 of be extremely small. them into each category: Top Picks, Honora- Based on the data and constraints of Top ble Mention, and neither. Picks, we formulated a ranking-based two- The selection committee meeting was phase process. The objective of the first phase held in person in Atlanta, Georgia, on was to filter about 35 to 40 papers that would 17 December 2016. At the selection com- be discussed at the selection committee meet- mittee meeting, the 38 papers were rank- ing. The objective of the second phase was to ordered on the basis of the number of Top increase the number of reviews per paper to Picks votes and the average rating the paper about 10 and ask each reviewer to provide a received in the second phase. If, after the in- concrete decision for the assigned paper: person discussion, 60 percent or more whether it should be selected as a Top Picks reviewers rated a paper as a Top Pick, then ...... 8 IEEE MICRO the paper was selected as a Top Pick. Other- Optimize Energy Efficiency of Deep Neural wise, the decision to select the paper as a Top Network Accelerators” by Yu-Hsin Chen and Pick (or Honorable Mention or neither) was his colleagues describes a spatial architecture made by a committee-wide vote using a sim- that optimizes the dataflow for energy effi- ple majority. We observed that the top eight ciency. This article also has an insightful ranked papers all got accepted as Top Picks, framework for classifying different accelera- and four more papers were selected as Top tors based on access patterns. Picks from the next nine papers. Overall, out “The Memristive Boltzmann Machines” of the top 25 papers, all but one was selected by Mahdi Nazm Bojnordi and Engin Ipek as either a Top Pick or an Honorable Men- proposes a memory-centric hardware acceler- tion. Thus, having a large number of reviews ator for combinatorial optimization and deep per paper reduced the dependency on the in- learning that leverages in-situ computing of person discussion. Coincidentally, the day bit-line computation in memristive arrays before the selection committee meeting there to eliminate the need for exchanging data was a hurricane, which caused many flights among the memory arrays and the computa- to be canceled, and 4 of the 31 selection com- tional units. mittee members were unable to attend the The concept of using analog computing meeting. However, having 9 to 10 reviewers for efficient computation is also explored by per paper still ensured that there were at least Yipeng Huang and colleagues in “Analog eight reviewers present for each paper dis- Computing in a Modern Context: A Linear cussed at the selection committee meeting, Algebra Accelerator Case Study.” The authors resulting in a robust and high-confidence try to address the typical challenges faced by process, even with a relatively high rate of analog computing, such as limited problem absentees. Given the unique constraints and size, limited dynamic range, and precision. objectives of Top Picks, we hope that such a In contrast to the first three articles, which process with a larger number of reviews per use domain-specific acceleration, “Domain paper and a process that is robust to variation Specialization Is Generally Unnecessary For in generosity levels of reviewers (for example, Accelerators” by Tony Nowatzki and his col- ranking papers into fixed-sized bins) will be leagues focuses on retaining the programm- useful for future Top Picks selection commit- ability of accelerators while maintaining their tees as well. energy efficiency. The authors use an architec- ture that has a large number of tiny cores with Selected Papers key building blocks typically required for accelerators and configure these cores intelli- With the slowing down of conventional gently based on the domain requirement. means for improving performance, the archi- tecture community has been investigating Large-Scale Accelerators accelerators to improve performance and The next three articles look at enhancing the energy efficiency. This was evident in the scalability of accelerators so that they can emergence of a large number of papers on handle larger problem sizes and cater to vary- accelerators appearing throughout the archi- ing problem domains. The article tecture conferences in 2016. Given the “Configurable Clouds” by Adrian Caulfield emphasis on accelerators, it is no surprise that and his colleagues describes a cloud-scale more than half of the articles in this issue acceleration architecture that can connect dif- focus on architecting accelerators. Memory ferent accelerator nodes within a datacenter system and energy considerations are two using a high-speed FPGA fabric that lets the other areas from which the Top Picks papers system accelerate a wide variety of applica- were selected. tions and has been deployed in Microsoft datacenters. Accelerators In “Specializing a Planet’s Computation: Data movement is a primary factor that ASIC Clouds,” Moein Khazraee and his col- determines the energy efficiency and effec- leagues target scale-out workloads comprising tiveness of accelerators. “Using Dataflow to many independent but similar jobs, often on ...... MAY/JUNE 2017 9 ...... GUEST EDITORS’INTRODUCTION
behalf of many users. This architecture shows perature inversion. In the article “Ti-states: a way to make ASIC usage more economical, Power Management in Active Timing Mar- because different users can potentially share gin Processors,” Yazhou Zu and his col- the cost of fabricating a given ASIC, rather leagues show how actively monitoring the than each design team incurring the cost of temperature on the chip and dynamically fabricating the ASIC. reducing this timing margin can result in sig- “DRAF: A Low-Power DRAM-Based nificant power savings. Reconfigurable Acceleration Fabric” by Min- Energy harvesting systems represent an gyu Gao and his colleagues describes a way to extreme end of energy-constrained comput- increase the size of FPGA fabrics at low cost ing in which the system performs computing by using DRAM instead of SRAM for the only when the harvested energy is present. storage inside the FPGA, thereby enabling a One challenge in such systems is to provide high-density and low-power reconfigurable debugging functionality for software, because fabric. system failure could happen due to either lack of energy or incorrect code. “An Energy- Memory and Storage Systems Aware Debugger for Intermittently Powered Memory systems continue to be important in Systems” by Alexei Colin and his colleagues determining the performance and efficiency describes a hardware–software debugger for of computer systems. This issue features three an intermittent energy-harvesting system that articles that focus on improving memory and can allow software verification to proceed storage systems. “Agile Paging for Efficient without getting interference from the energy- Memory Virtualization” by Jayneel Gandhi harvesting circuit. and his colleagues addresses the performance overhead of virtual memory in virtualized environments by getting the best of both e hope you enjoy reading these articles worlds: nested paging and shadow paging. W and that you will explore both the Virtual address translation can some- original conference versions and the Honora- times affect the correctness of memory con- ble Mention papers. We welcome your feed- sistency models. Daniel Lustig and his back on this special issue and any suggestions colleagues address this problem in their article, fornextyear’sTopPicksissue. MICRO “Transistency Models: Memory Ordering at the Hardware–OS Interface.” The authors Acknowledgments propose to rigorously integrate memory con- We thank Lieven Eeckhout for providing sistency models and address translation at the support and direction as we tried out the microarchitecture and operating system levels. new paper selection process. Lieven also Moving on to the storage domain, in handled the papers that were conflicted with “Toward a DNA-Based Archival Storage Sys- both co-chairs. We also thank the selection tem,” James Bornholt and his colleagues committee co-chairs for the past three Top demonstrate DNA-based storage architected Picks issues (Gabe Loh, Babak Falsafi, Luis as a key-value store. Their design enables ran- Ceze, Karin Strauss, Milo Martin, and Dan dom access and is equipped with error correc- Sorin) for providing the review statistics from tion capability to handle the imperfections of their editions of Top Picks and for answering the read and write process. As the demand our questions. We thank Vinson Young for for cheap storage continues to increase, such handling the submission website and Pra- alternative technologies have the potential to shant Nair and Jian Huang for facilitating the provide a major breakthrough in storage process at the selection committee meeting. capability. We owe a huge thanks to our fantastic selec- tion committee, which not only diligently Energy Considerations reviewed all the papers but also were suppor- The final two articles are related to optimiz- tive of the new review process. Furthermore, ing energy or operating under low energy the selection committee members spent a day budgets. Modern processors are provisioned attending the in-person meeting in Atlanta, with a timing margin to protect against tem- fairly close to the holiday season. Finally, we ...... 10 IEEE MICRO thank all the authors who submitted their Moinuddin Qureshi is an associate profes- work for consideration to this Top Picks issue sor in the School of Electrical and Com- and the authors of the selected papers for pro- puter Engineering at Georgia Tech. Contact ducing the final versions of their papers for him at [email protected]. this issue.
Aamer Jaleel is a principal research scientist Read your subscriptions through at Nvidia. Contact him at ajaleel@nvidia the myCS publications portal at http://mycs.computer.org. .com.
...... MAY/JUNE 2017 11 ...... USING DATAFLOW TO OPTIMIZE ENERGY EFFICIENCY OF DEEP NEURAL NETWORK ACCELERATORS
......
THE AUTHORS DEMONSTRATE THE KEY ROLE DATAFLOWS PLAY IN OPTIMIZING ENERGY
EFFICIENCY FOR DEEP NEURAL NETWORK (DNN) ACCELERATORS.THEY INTRODUCE BOTH A
SYSTEMATIC APPROACH TO ANALYZE THE PROBLEM AND A NEW DATAFLOW, CALLED
ROW-STATIONARY, THAT IS UP TO 2.5 TIMESMOREENERGYEFFICIENTTHANEXISTING
Yu-Hsin Chen DATAFLOWS IN PROCESSING A STATE-OF-THE-ART DNN. THIS ARTICLE PROVIDES
Massachusetts Institute of GUIDELINES FOR FUTURE DNN ACCELERATOR DESIGNS. Technology ...... Recent breakthroughs in deep use high parallelism to achieve high process- Joel Emer neural networks (DNNs) are leading to an ing throughput. However, this processing industrial revolution based on AI. The super- also requires a significant amount of data Nvidia and Massachusetts ior accuracy of DNNs, however, comes at movement: each MAC performs three reads the cost of high computational complexity. and one write of data access. Because moving Institute of Technology General-purpose processors no longer deliver data can consume more energy than the sufficient processing throughput and energy computation itself,1 optimizing data move- Vivienne Sze efficiency for DNNs. As a result, demands ment becomes key to achieving high energy for dedicated DNN accelerators are increas- efficiency. Massachusetts Institute of ing in order to support the rapidly growing Data movement can be optimized by use of AI. exploiting data reuse in a multilevel storage Technology The processing of a DNN mainly com- hierarchy. By maximizing the reuse of data in prises multiply-and-accumulate (MAC) oper- the lower-energy-cost storage levels (such as ations (see Figure 1). Most of these MACs are local scratchpads), thus reducing data accesses performed in the DNN’s convolutional to the higher-energy-cost levels (such as layers, in which multichannel filters are con- DRAM), the overall data movement energy volved with multichannel input feature maps consumption is minimized. (ifmaps, such as images). This generates par- In fact, DNNs present many data reuse tial sums (psums) that are further accumu- opportunities. First, there are three types lated into multichannel output feature maps of input data reuse: filter reuse, wherein (ofmaps). Because the MAC operations have each filter weight is reused across multiple few data dependencies, DNN accelerators can ifmaps; ifmap reuse, wherein each ifmap ......
12 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE pixel is reused across multiple filters; and convolutional reuse, wherein both ifmap pix- Input feature maps (ifmaps) els and filter weights are reused due to the Output feature maps C (ofmaps) sliding-window processing in convolutions. Filters M Second, the intermediate psums are reused C through the accumulation of ofmaps. If not H Partial E accumulated and reduced as soon as possi- R sums (psums) ble, the psums can pose additional storage 1 1 1 R H pressure. . . E. . . . A design can exploit these data reuse . . . opportunities by finding the optimal MAC operation mapping, which determines both C C M the temporal and spatial scheduling of the MACs on a highly parallel architecture. R E H Ideally, data in the lower-cost storage levels is M N reused by as many MACs as possible before R N E replacement. However, due to the limited H amount of local storage, input data reuse (ifmaps and filters) and psum reuse cannot Figure 1. In the processing of a deep neural network (DNN), multichannel be fully exploited simultaneously. For exam- filters are convolved with the multichannel input feature maps, which then ple, reusing the same input data for multiple generate the output feature maps. The processing of a DNN comprises MACs generates psums that cannot be accu- many multiply-and-accumulate (MAC) operations. mulated together and, as a result, consume extra storage space. Therefore, the system energy efficiency is maximized only when the Because state-of-the-art DNNs come in a mapping balances all types of data reuse in a wide range of shapes and sizes, the corre- multilevel storage hierarchy. sponding optimal mappings also vary. The Thesearchforthemappingthatmaxi- question is, can we find a dataflow that accom- mizes system energy efficiency thus becomes modates the mappings that optimize data an optimization process. This optimization movement for various DNN shapes and sizes? must consider the following factors: the data In this article, we explore different DNN reuse opportunities available for a given dataflows to answer this question in the con- DNN shape and size (for example, the num- text of a spatial architecture.2 In particular, we ber of filters, number of channels, size of fil- will present the following key contributions:3 ters, and feature map size), the energy cost of data access at each level of the storage hier- An analogy between DNN accelera- archy, and the available processing parallelism tors and general-purpose processors and storage capacity. The first factor is a func- that clearly identifies the distinct tion of workload, whereas the second and aspects of operation of a DNN accel- third factors are a function of the specific erator, which provides insights into accelerator implementation. opportunities for innovation. Because of implementation tradeoffs, pre- A framework that quantitatively eval- vious proposals for DNN accelerators have uates the energy consumption of dif- made choices on the subset of mappings that ferent mappings for different DNN can be supported. Therefore, for a specific shapes and sizes, which is an essential DNN accelerator design, the optimal map- tool for finding the optimal mapping. ping can be selected only from the subset of A taxonomy that classifies existing supported mappings instead of the entire dataflows from previous DNN accel- mapping space. The subset of supported erator projects, which helps to under- mappings is usually determined by a set of stand a large body of work despite mapping rules, which also characterizes the differences in the lower-level details. hardware implementation. Such a set of map- A new dataflow, called Row-Stationary pingrulesdefinesadataflow. (RS), which is the first dataflow to ...... MAY/JUNE 2017 13 ...... TOP PICKS
tecture. Similar to the role of an ISA or memory consistency model, the dataflow Compilation Execution defines the mapping rules that the mapper DNN shape and size must follow in order to generate hardware- (Program) Processed compatible mappings. Later in this article, data we will introduce several previously pro- Dataflow, ... posed dataflows. (Architecture) Other attributes of a DNN accelerator, Mapper DNN accelerator such as the storage organization, also are (Compiler) (Processor) analogous to parts of the general-purpose Implementation details processor architecture, such as scratchpads or (μArch) virtual memory. We consider these attributes Mapping Input (Binary) data part of the architecture, instead of microarch- itecture, because they may largely remain invariant across implementations. Although, Figure 2. An analogy between the operation of DNN accelerators (roman similar to GPUs, the distinction between text) and that of general-purpose processors (italicized text). architecture and microarchitecture is likely to blur for DNN accelerators. Implementation details, such as those that optimize data movement for superior determine access energy cost at each level of system energy efficiency. It has also the storage hierarchy and latency between been verified in a fabricated DNN processing elements (PEs), are analogous to accelerator chip, Eyeriss.4 the microarchitecture of processors, because a mapping will be valid despite changes in We evaluate the energy efficiency of the these characteristics. However, they play a RS dataflow and compare it to other data- vital part in determining a mapping’s energy flows from the taxonomy. The comparison efficiency. uses a popular state-of-the-art DNN model, 5 The mapper’s goal is to search in the map- AlexNet, with a fixed amount of hardware ping space for the mapping that best opti- resources. Simulation results show that the mizes data movement. The size of the entire RS dataflow is 1.4 to 2.5 times more energy mapping space is determined by the total efficient than other dataflows in the convolu- number of MACs, which can be calculated tional layers. It is also at least 1.3 times more from the DNN shape and size. However, energy efficient in the fully connected layers only a subset of the space is valid given the for batch sizes of at least 16. These results mapping rules defined by a dataflow. For will provide guidance for future DNN accel- example, the dataflow can enforce the follow- erator designs. ing mapping rule: all MACs that use the same filter weight must be mapped on the An Analogy to General-Purpose Processors same PE in the accelerator. Then, it is the Figure 2 shows an analogy between the oper- mapper’s job to find out the exact ordering of ation of DNN accelerators and general- these MACs on each PE by evaluating and purpose processors. In conventional computer comparing the energy efficiency of different systems, the compiler translates a program valid ordering options. into machine-readable binary codes for exe- As in conventional compilers, performing cution; in the processing of DNNs, the map- evaluation is an integral part of the mapper. per translates the DNN shape and size into a The evaluation process takes a certain map- hardware-compatible mapping for execu- ping as input and gives an energy consump- tion. While the compiler usually optimizes tion estimation based on the available for performance, the mapper especially opti- hardware resources (microarchitecture) and mizes for energy efficiency. data reuse opportunities extracted from the The dataflow is a key attribute of a DNN DNN shape and size (program). In the next accelerator and is analogous to one of the section, we will introduce a framework that parts of a general-purpose processor’s archi- can perform this evaluation...... 14 IEEE MICRO Evaluating Energy Consumption Finding the optimal mapping requires evalu- CPU iFIFO/oFIFO PE array Accelerator chip ation of the energy consumption for various GPU mapping options. In this article, we evaluate Off-chip energy consumption based on a spatial archi- DRAM tecture,2 because many of the previous Global buffer designs can be thought of as instances of such an architecture. The spatial architecture (see Figure 3) consists of an array of PEs and a iFIFO/oFIFO PE array multilevel storage hierarchy. The PE array (zoom in) provides high parallelism for high through- put, whereas the storage hierarchy can be RF RF RF used to exploit data reuse in a four-level setup pFIFO pFIFO (in decreasing energy-cost order): DRAM, pFIFO global buffer, network-on-chip (NoC, for inter-PE communication), and register file (RF) in the PE as local scratchpads. Figure 3. Spatial array architecture comprises an array of processing In this architecture, we assume all data elements (PEs) and a multilevel storage hierarchy, including the off-chip types can be stored and accessed at any level DRAM, global buffer, network-on-chip (NoC), and register file (RF) in the PE. of the storage hierarchy. Input data for the The off-chip DRAM, global buffer, and PEs in the array can communicate MAC operations—that is, filter weights and with each other directly through the input and output FIFOs (the iFIFO and ifmap pixels—are moved from the most oFIFO). Within each PE, the PE FIFO (pFIFO) controls the traffic going in and expensive level (DRAM) to the lower-cost out of the arithmetic logic unit (ALU), including from the RF or other storage levels. Ultimately, they are usually delivered levels. from the least expensive level (RF) to the arithmetic logic unit (ALU) for computation. The results from the ALU—that is, psums— generally move in the opposite direction. Normalized energy cost The orchestration of this movement is deter- mined by the mappings for a specific DNN Computation 1 MAC at ALU 1× (Reference) shape and size under the mapping rule con- RF (0.5 to 1.0 Kbytes) 1× straints of a specific dataflow architecture. Given a specific mapping, the system NoC (1 to 2 mm) 2× Data access energy consumption is estimated by account- Global buffer × (>100 Kbytes) 6 ing for the number of times each data value × from all data types (ifmaps, filters, psums) is DRAM 200 reused at each level of the four-level memory hierarchy, and weighing it with the energy Figure 4. Normalized energy cost relative to the computation of one MAC cost of accessing that specific storage level. operation at ALU. Numbers are extracted from a commercial 65-nm Figure 4 shows the normalized energy con- process. sumption of accessing data from each storage level relative to the computation of a MAC at each PE can hold only one ifmap pixel at a the ALU. We extracted these numbers from a time. The mapping first reads an ifmap pixel commercial 65-nm process and used them in from DRAM to the global buffer, then from our final experiments. the global buffer to the RF of each PE Figure 5 uses a toy example to show how a through the NoC, and reuses it from the RF mapping determines the data reuse at each for four MACs consecutively in each PE. The storage level, and thus the energy consump- mapping then switches to MACs that use tion, in a three-PE setup. In this example, we other ifmap pixels, so the original one in the have the following assumptions: each ifmap RF is replaced by new ones, due to limited pixel is used by 24 MACs, all ifmap pixels capacity. Therefore, the original ifmap pixel can fit into the global buffer, and the RF of must be fetched from the global buffer again ...... MAY/JUNE 2017 15 ...... TOP PICKS
valid mapping space based on how they han- dle data. As a result, we can classify them into iFIFO/oFIFO PE array ataxonomy.
RF RF RF The Weight-Stationary (WS) data- Global flow keeps filter weights stationary in DRAM buffer pFIFO pFIFO pFIFO each PE’s RF by enforcing the follow- ing mapping rule: all MACs that use Memory level Buffer level the same filter weight must be NoC level mapped on the same PE for process- ing serially. This maximizes the con- RF level volutional and filter reuse of weights Ifmap pixel in the RF, thus minimizing the data movement Processing other data . . . energy consumption of accessing weights (for example, work by Srimat Chakradhar and colleagues6 and 7 time Vinayak Gokhale and colleagues ). Figure 6a shows the data movement of a common WS dataflow imple- Figure 5. Example of how a mapping determines data reuse at each storage mentation. While each weight stays level. This example shows the data movement of one ifmap pixel going in the RF of each PE, the ifmap pixels through the storage hierarchy. Each arrow means moving data between are broadcast to all PEs, and the gen- specific levels (or to an ALU for computation). erated psums are then accumulated spatially across PEs. when the mapping switches back to the The Output-Stationary (OS) data- MACs that use it. In this case, the same flow keeps psums stationary by accu- ifmap pixel is reused at the DRAM, global mulating them locally in the RF. The buffer, NoC, and RF for 1, 2, 6, and 24 mapping rule is that all MACs that times, respectively. The corresponding nor- generate psums for the same ofmap malized energy consumption of moving this pixel must be mapped on the same ifmap pixel is obtained by weighing these PE serially. This maximizes psum numbers with the normalized energy num- reuse in the RF, thus minimizing bers in Figure 4 and then adding them energy consumption of psum move- ment (for example, work by Zidong together (that is, 1 200 þ 2 6 þ 6 2 8 þ 24 1 ¼ 248). For other data types, the Du and colleagues, Suyog Gupta and colleagues,9 and Maurice Pee- same approach can be applied. 10 This analysis framework can be used not men and colleagues ). The data only to find the optimal mapping for a spe- movement of a common OS dataflow cific dataflow, but also to evaluate and com- implementation is to broadcast filter pare the energy consumption of different weights while passing ifmap pixels dataflows. In the next section, we will spatially across the PE array (see describe various existing dataflows. Figure 6b). Unlike the previous two dataflows, which keep a certain data type sta- A Taxonomy of Existing DNN Dataflows tionary, the No-Local-Reuse (NLR) Numerous previous efforts have proposed dataflow keeps no data stationary solutions for DNN acceleration. These locally so it can trade the RF off for a designs reflect a variety of trade-offs between larger global buffer. This is to mini- performance and implementation complex- mize DRAM access energy consump- ity. Despite their differences in low-level tion by storing more data on-chip implementation details, we find that many of (for example, work by Tianshi Chen them can be described as embodying a set of and colleagues11 and Chen Zhang rules—that is, a dataflow—that defines the and colleagues12). The corresponding ...... 16 IEEE MICRO mapping rule is that at each process- ing cycle, all parallel MACs must Ifmap pixel (l) Filter weight (W) Psum (P) come from a unique pair of filter and Weight-Stationary (WS) dataflow channel. The data movement of the NLR dataflow is to single-cast weights, Global buffer multicast ifmap pixels, and spatially P8 I8 P0 accumulate psums across the PE array (see Figure 6c). W0 P7 W1 P6 W2 P5 W3 P4 W4 P3 W5 P2 W6 P1 W7 PE The three dataflows show distinct data (a) movement patterns, which imply different Output-Stationary (OS) dataflow tradeoffs. First, as Figures 6a and 6b show, Global buffer the cost for keeping a specific data type sta- I7 W7 tionary is to move the other types of data more. Second, the timing of data accesses also matters. For example, in the WS data- P0 I6 P1 I5 P2 I4 P3 I3 P4 I2 P5 I1 P6 I0 P7 PE flow, each ifmap pixel read from the global (b) buffer is broadcast to all PEs with properly No-Local-Reuse (NLR) dataflow mapped MACs on the PE array. This is more efficient than reading the same value multiple Global buffer times from the global buffer and single-cast- W0P8I0 W1P9 W2 I1 W3 W4 I2 W5 W6P0 I3 W7P1 ing it to the PEs, which is the case for filter P6 P4 P2 weights in the NLR dataflow (see Figure 6c). PE Other dataflows can make other tradeoffs. In P7 P5 P3 the next section, we present a new dataflow (c) that takes these factors into account for opti- mizing energy efficiency. Figure 6. Dataflow taxonomy. (a) Weight Stationary. (b) Output Stationary. (c) No Local Reuse. An Energy-Efficient Dataflow The ordering of these MACs enables Although the dataflows in the taxonomy the use of a sliding window for ifmaps, describe the design of many DNN accelera- as shown in Figure 7. tors, they optimize data movement only for a specific data type (for example, WS for Convolutional and psum reuse opportu- weights) or storage level (NLR for DRAM). nities within a row primitive are fully In this section, we introduce a new dataflow, exploited in the RF, given sufficient RF stor- called Row-Stationary (RS), which aims to age capacity. optimize data movement for all data types in Even with the RS dataflow, as defined by all levels of the storage hierarchy of a spatial the row primitives, there are still a large num- architecture. ber of valid mapping choices. These mapping The RS dataflow divides the MACs into choices arise both in the spatial and temporal mapping primitives, each of which comprises assignment of primitives to PEs: a subset of MACs that run on the same PE in a fixed order. Specifically, each mapping 1. One spatial mapping option is to primitive performs a 1D row convolution, so assign primitives with data rows we call it a row primitive, and intrinsically from the same 2D plane on the PE optimizes data reuse per MAC for all data array, to lay out a 2D convolution types combined. Each row primitive is (see Figure 8). This mapping fully formed with the following rules: exploits convolutional and psum reuse opportunities across primitives The MACs for applying a row of fil- in the NoC: the same rows of filter ter weights on a row of ifmap pixels, weights and ifmap pixels are reused which generate a row of psums, must across PEs horizontally and diago- be mapped on the same PE. nally, respectively; psum rows are ...... MAY/JUNE 2017 17 ...... TOP PICKS
becomes a larger 1D row convolution, Filter row Ifmap row Psum row which exploits these cross-primitive A BC∗ a bcde = xyz data reuse opportunities in the RF. 4. Another temporal mapping choice Time arises when the PE array size is too MAC MAC MAC MAC MAC MAC MAC MAC MAC 1 2 3 4 5 6 7 8 9 small, and the originally spatially Filter weight: ABC ABC ABC mapped row primitives must be tem- x+xx + xxx + + xxx + + porally folded into multiple process- Ifmap pixel: abc abc abc ing passes (that is, the computation is II II II serialized). In this case, the data reuse Psum: xyz opportunities that are originally spa- tially exploited in the NoC can be Figure 7. Each row primitive in the Row-Stationary (RS) dataflow runs a 1D temporally exploited by the global row convolution on the same PE in a sliding-window processing order. buffer to avoid DRAM accesses, given sufficient storage capacity. As evident from the preceding list, the RS dataflow provides a high degree of mapping Row 1 Row 2 Row 3 flexibility, such as using concatenation, inter- PE1 PE4 PE7 leaving, duplicating, and folding of the row Row 1∗ Row 1 Row 1∗ Row 2 Row 1∗ Row 3 primitives. The mapper searches for the exact PE2 PE5 PE8 amount to apply each technique in the opti- Row 2∗ Row 2 Row 2∗ Row 3 Row 2∗ Row 4 mal mapping—for example, how many fil- PE3 PE6 PE9 ters are interleaved on the same PE to exploit Row 3∗ Row 3 Row 3∗ Row 4 Row 3∗ Row 5 ifmap reuse—to minimize overall system energy consumption. ∗ ==∗∗ = Dataflow Comparison Figure 8. Patterns of how row primitives from the same 2D plane are In this section, we quantitatively compare the mapped onto the PE array in the RS dataflow. energy efficiency of different DNN dataflows in a spatial architecture, including those from the taxonomy and the proposed RS dataflow. further accumulated across PEs We use AlexNet5 as the benchmarking DNN vertically. because it is one of the most popular DNNs 2. Another spatial mapping option available, and it comprises five convolutional arises when the size of the PE array is (CONV) layers and three fully connected large, and the pattern shown in (FC) layers with a wide variety of shapes and Figure 8 can be spatially duplicated sizes, which can more thoroughly evaluate across the PE array for various 2D the optimal mappings from each dataflow. convolutions. This not only increases In order to have a fair comparison, we utilization of PEs, but also further apply the following two constraints for all exploits filter, ifmap, and psum reuse dataflows. First, the size of the PE array is opportunities in the NoC. fixed at 256 for constant processing through- 3. One temporal mapping option arises put across dataflows. Second, the total hard- when row primitives from different ware area is also fixed. For example, because 2D planes can be concatenated or the NLR dataflow does not use an RF, it can interleaved on the same PE. As Figure allocate more area for the global buffer. The 9 shows, primitives with different corresponding hardware resource parameters ifmaps, filters, and channels have filter are based on the RS dataflow implementation reuse, ifmap reuse, and psum reuse in Eyeriss, a DNN accelerator chip fabricated opportunities, respectively. By concat- in 65-nm CMOS.4 By applying these con- enating or interleaving their computa- straints, we fix the total cost to implement tion together in a PE, it virtually the microarchitecture of each dataflow...... 18 IEEE MICRO Filter 1 Ifmap 1 Psum 1 Channel 1Row 1 ∗ Row 1 = Row 1 Filter 1 Ifmap 1 and 2 Psum 1 and 2 ∗ Filter 1 Ifmap 2 Psum 2 Row 1 Row 1 Row 1= Row 1 Row 1 Channel 1 Row 1 ∗ Row 1 = Row 1 filter reuse (a)
Filter 1 Ifmap 1 Psum 1 Channel 1 Row 1 ∗ Row 1= Row 1 Filter 1 and 2 Ifmap 1 Psum 1 and 2 ∗ = Filter 2 Ifmap 1 Psum 2 Channel 1 Row 1 ∗ Row 1 = Row 1 Ifmap reuse (b)
Filter 1 Ifmap 1 Psum 1 Channel 1 Row 1∗ Row 1 = Row 1 Filter 1 Ifmap 1 Psum ∗ Filter 1 Ifmap 1 Psum 1 = Row 1 Channel 2 Row 1 ∗ Row 1 = Row 1 psum reuse (can be further accumulated) (c)
Figure 9. Row primitives from different 2D planes can be combined by concatenating or interleaving their computation on the same PE to further exploit data reuse at the RF level. (a) Two-row primitives reuse the same filter row for different ifmap rows. (b) Two-row primitives reuse the same ifmap row for different filter rows. (c) Two-row primitives from different channels further accumulate psum rows.
Therefore, the differences in energy efficiency DRAM alone does not dictate energy effi- are solely from the dataflows. ciency, and optimizing the energy consump- Figures 10a and 10b show the comparison tion for only a certain data type does not lead of energy efficiency between dataflows in the to the best system energy efficiency. Overall, CONV layers of AlexNet with an ifmap batch the RS dataflow is 1.4 to 2.5 times more size of 16. Figure 10a gives the breakdown in energy efficient than other dataflows in the terms of storage levels and ALU, and Figure CONV layers of AlexNet. 10b gives the breakdown in terms of data Figure 11 shows the same experiment types. First, the ALU energy consumption is results as in Figure 10b, except that it is for only a small fraction of the total energy con- the FC layers of AlexNet. Compared to the sumption, which proves the importance of CONV layers, the FC layers have no convo- data movement optimization. Second, even lutional reuse and use much more filter though NLR has the lowest energy consump- weights. Still, the RS dataflow is at least 1.3 tion in DRAM, its total energy consumption times more energy efficient than the other is still high, because most of the data accesses dataflows, which proves that the capability to come from the global buffer, which are more optimize data movement for all data types is expensive than those from the NoC or RF. the key to achieving the highest overall Third, although WS and OS dataflows clearly energy efficiency. Note that the FC layers optimize the energy consumption of accessing account for less than 20 percent of the total weights and psums, respectively, they sacrifice energy consumption in AlexNet. In recent the energy consumption of moving other data DNNs, the number of FC layers have also types, and therefore do not achieve the lowest been greatly reduced, making their energy overall energy consumption. This shows that consumption even less significant...... MAY/JUNE 2017 19 ...... TOP PICKS
2.0 2.0
1.5 RF 1.5 NoC psums 1.0 Buffer 1.0 weights DRAM pixels 0.5 ALU 0.5 Normalized energy/MAC Normalized energy/MAC 0 0 WSOSA OSB OSC NLR RS WSOSA OSB OSC NLR RS (a)DNN dataflows (b) DNN dataflows
Figure 10. Comparison of energy efficiency between different dataflows in the convolutional (CONV) layers of AlexNet.5 (a)
Breakdown in terms of storage levels and ALU versus (b) data types. OSA,OSB, and OSC are three variants of the OS dataflow that are commonly seen in different implementations.3
tinct architectures, it is also possible to come up with a union architecture that can support 2.0 multiple dataflows simultaneously. The ques- tions are how to choose a combination of 1.5 psums dataflows that maximally benefit the search weights for optimal mappings, and how to support 1.0 pixels these dataflows with the minimum amount of hardware implementation overhead. 0.5 This article has also pointed out how the
Normalized energy/MAC 0 concept of DNN dataflows and the mapping WSOSA OSB OSC NLR RS of a DNN computation onto a dataflow can be DNN dataflows viewed as analogous to a general-purpose pro- cessor’s architecture and compiling onto that Figure 11. Comparison of energy efficiency between different dataflows in architecture. We hope this will open up space the fully connected (FC) layers of AlexNet. for computer architects to approach the design of DNN accelerators by applying the knowl- edge and techniques from a well-established esearch on architectures for DNN research field in a more systematic manner, R accelerators has become very popular such as methodologies for design abstraction, for its promising performance and wide modularization, and performance evaluation. applicability. This article has demonstrated For instance, a recent research trend for the key role of dataflows in DNN accelerator DNNs is to exploit data statistics. Specifically, design, and it shows how to systematically different proposals on quantization, pruning, exploit all types of data reuse in a multilevel and data representation have all shown prom- storage hierarchy for optimizing energy effi- ising results on improving the performance of ciency with a new dataflow. It challenges con- DNNs. Therefore, it is important that new ventional design approaches, which focus architectures also take advantage of these find- more on optimizing parts of the problem, ings. As compilers for general-purpose pro- and shifts it toward a global optimization cessors can take the profile of targeted that considers all relevant metrics. workloads to further improve the performance The taxonomy of dataflows lets us compare of the generated binary, the analogy between high-level design choices irrespective of low- general-purpose processors and DNN acceler- level implementation details, and thus can be ators suggests that the mapper for DNN accel- used to guide future designs. Although these erators might also take the profile of targeted dataflows are currently implemented on dis- DNN statistics to further optimize the ...... 20 IEEE MICRO generated mappings. This is an endeavor we 11. T. Chen et al., “DianNao: A Small-Footprint will leave for future work. MICRO High-Throughput Accelerator for Ubiquitous Machine-Learning,” Proc. 19th Int’l Conf...... Architectural Support for Programming Lan- References guages and Operating Systems (ASPLOS 1. M. Horowitz, “Computing’s Energy Prob- 14), 2014, pp. 269–284. lem (And What We Can Do About It),” Proc. 12. C. Zhang et al., “Optimizing FPGA-based IEEE Int’l Solid-State Circuits Conf. (ISSCC Accelerator Design for Deep Convolutional 14), 2014, pp. 10–14. Neural Networks,” Proc. ACM/SIGDA Int’l 2. A. Parashar et al., “Triggered Instructions: A Symp. Field-Programmable Gate Arrays Control Paradigm for Spatially-Programmed (FPGA 15), 2015, pp. 161–170. Architectures,” Proc. 40th Ann. Int’l Symp. Computer Architecture (ISCA 13), 2013, pp. Yu-Hsin Chen is a PhD student in the 142–153. Department of Electrical Engineering and 3. Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Computer Science at the Massachusetts Spatial Architecture for Energy-Efficient Data- Institute of Technology. His research inter- flow for Convolutional Neural Networks,” Proc. ests include energy-efficient multimedia sys- ACM/IEEE 43rd Ann. Int’l Symp. Computer tems, deep learning architectures, and com- Architecture (ISCA 16), 2016, pp. 367–379. puter vision. Chen received an MS in electrical engineering and computer science 4. Y.-H. Chen et al., “Eyeriss: An Energy- from the Massachusetts Institute of Tech- Efficient Reconfigurable Accelerator for nology. He is a student member of IEEE. Deep Convolutional Neural Networks,” Contact him at [email protected]. Proc. IEEE Int’l Solid-States Circuits Conf. (ISSCC 16), 2016, pp. 262–263. Joel Emer is a senior distinguished research 5. A. Krizhevsky, I. Sutskever, and G.E. Hinton, scientistatNvidiaandaprofessorofelectrical “ImageNet Classification with Deep Convo- engineering and computer science at the Mas- lutional Neural Networks,” Proc. 25th Int’l sachusetts Institute of Technology. His research Conf. Neural Information Processing Sys- interests include spatial and parallel architec- tems (NIPS 12), 2012, pp. 1097–1105. tures, performance modeling, reliability analy- 6. S. Chakradhar et al., “A Dynamically Config- sis, and memory hierarchies. Emer received a urable Coprocessor for Convolutional Neural PhD in electrical engineering from the Uni- Networks,” Proc. 37th Ann. Int’l Symp. versity of Illinois. He is a Fellow of IEEE. Computer Architecture (ISCA 10), 2010, pp. Contact him at [email protected]. 247–257. Vivienne Sze is an assistant professor in the 7. V. Gokhale et al., “A 240 G-ops/s Mobile Department of Electrical Engineering and Coprocessor for Deep Neural Networks,” Computer Science at the Massachusetts Proc. IEEE Conf. Computer Vision and Pat- Institute of Technology. Her research inter- tern Recognition Workshops (CVPRW 14), ests include energy-aware signal processing 2014, pp. 696–701. algorithms and low-power architecture and 8. Z. Du et al., “ShiDianNao: Shifting Vision system design for multimedia applications, Processing Closer to the Sensor,” Proc. such as machine learning, computer vision, ACM/IEEE 42nd Ann. Int’l Symp. Computer and video coding. Sze received a PhD in elec- Architecture (ISCA 15), 2015, pp. 92–104. trical engineering from the Massachusetts 9. S. Gupta et al., “Deep Learning with Limited Institute of Technology. She is a senior mem- Numerical Precision,” Proc. 32nd Int’l Conf. ber of IEEE. Contact her at [email protected]. Machine Learning, vol. 37, 2015, pp. 1737–1746. 10. M. Peemen et al., “Memory-Centric Accelera- tor Design for Convolutional Neural Networks,” Read your subscriptions through Proc. IEEE 31st Int’l Conf. Computer Design the myCS publications portal at http://mycs.computer.org. (ICCD 13), 2013, pp. 13–19...... MAY/JUNE 2017 21 ...... THE MEMRISTIVE BOLTZMANN MACHINES
......
THE PROPOSED MEMRISTIVE BOLTZMANN MACHINE IS A MASSIVELY PARALLEL,
MEMORY-CENTRIC HARDWARE ACCELERATOR BASED ON RECENTLY DEVELOPED RESISTIVE
RAM (RRAM) TECHNOLOGY.THE PROPOSED ACCELERATOR EXPLOITS THE ELECTRICAL
PROPERTIES OF RRAM TO REALIZE IN SITU, FINE-GRAINED PARALLEL COMPUTATION
WITHIN MEMORY ARRAYS, THEREBY ELIMINATING THE NEED FOR EXCHANGING DATA
BETWEEN THE MEMORY CELLS AND COMPUTATIONAL UNITS.
...... Combinatorial optimization is a hardware.2 With the growing interest in deep Mahdi Nazm Bojnordi branch of discrete mathematics that is con- learning models that rely on Boltzmann University of Utah cerned with finding the optimum element of a machines for training (such as deep belief net- finite or countably infinite set. An enormous works), the importance of high-performance number of critical problems in science and Boltzmann machine implementations is Engin Ipek engineering can be cast within the combinato- increasing. Regrettably, the required all-to-all rial optimization framework, including classi- communication among the processing units University of Rochester calproblemssuchastravelingsalesman,integer limits these recent efforts’ performance. linear programming, knapsack, bin packing, The memristive Boltzmann machine is a and scheduling problems, as well as numerous massively parallel, memory-centric hardware optimization problems in machine learning accelerator for the Boltzmann machine based and data mining. Because many of these prob- on recently developed resistive RAM lems are NP-hard, heuristic algorithms are (RRAM) technology. RRAM is a memristive, commonly used to find approximate solutions nonvolatile memory technology that provides for even moderately sized problem instances. Flash-like density and DRAM-like read Simulated annealing is one of the most speed. The accelerator exploits the electrical commonly used optimization algorithms. On properties of the bitlines and wordlines in a many types of NP-hard problems, simulated conventional single-level cell (SLC) RRAM annealing achieves better results than other array to realize in situ, fine-grained parallel heuristics; however, its convergence may be computation, which eliminates the need for slow. This problem was first addressed by exchanging data among the memory arrays reformulating simulated annealing within the and computational units. The proposed context of a massively parallel computational hardware platform connects to a general- model called the Boltzmann machine.1 The purpose system via the DDRx interface and Boltzmann machine is amenable to a massively can be selectively integrated with systems that parallel implementation in either software or run optimization workloads......
22 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE Computation within Memristive Arrays S 9 –10 9 Thekeyideabehindtheproposedmemory- 5 1 centric accelerator is to exploit the electrical 4 1 3 Mapping –8 –2 –6 properties of the storage cells and the intercon- 1 7 –2 –14 110 nections among those cells to compute the dot 5 9 10 product—the fundamental building block Cost = 19 Energy = –19 of the Boltzmann machine—in situ within the memory arrays. This novel capability of Figure 1. Mapping a Max-Cut problem to the Boltzmann machine model. An the proposed memristive arrays eliminates example five-vertex undirected graph is mapped and partitioned using a five- unnecessary latency, bandwidth, and energy node Boltzmann machine. overheads associated with streaming the data out of the memory arrays during computation.
x + j – Vsupply The Boltzmann Machine li The Boltzmann machine, proposed by Geof- x frey Hinton and colleagues in 1983,2 is a 0 well-known example of a stochastic neural Wordlines wji network that can learn internal representa- Bitline xi tions and solve combinatorial optimization Iji problems. The Boltzmann machine is a fully connected network comprising two-state Ij = Iji = xjxixji units. It employs simulated annealing for i = 0 i = 0 transitioning between the possible network states. The units flip their states on the basis Figure 2. The key concept of in situ of the current state of their neighbors and the computation within memristive arrays. corresponding edge weights to maximize a Current summation within every bitline is global consensus function, which is equiva- used to compute the result of a dot product. lent to minimizing the network energy. Many combinatorial optimization prob- (dij) are represented by a symmetric weight lems, as well as machine learning tasks, can be matrix, the maximum cut problem is to find a mapped directly onto a Boltzmann machine subsetXS {1, …, N} of the nodes that maxi- mizes d ,inwhichi ʦ S and j 26 S. To by choosing the appropriate edge weights and i;j ij theinitialstateoftheunitswithinthenet- solve the problem on a Boltzmann machine, a work. As a result of this mapping, each possi- one-to-one mapping is established between ble state of the network represents a candidate the graph G and a Boltzmann machine with solution to the optimization problem, and N processing units. TheX Boltzmann machine is configured as w ¼ d and w ¼ –2d . minimizing the network energy becomes jj i ji ji ji equivalent to solving the optimization prob- When the machine reaches its lowest energy, lem. The energy minimization process is typi- (E(x) ¼ 19), the state variables represent the cally performed either by adjusting the edge optimum solution, in which a value of 1 at weights (learning) or recomputing the unit unit i indicates that the corresponding graphi- states (searching and classifying). This process cal node belongs to S. is repeated until convergence is reached. The solution to an optimization problem In Situ Computation can be found by reading—and appropri- The critical computation that the Boltzmann ately interpreting—the network’s final state. machine performs consists of multiplying a For example, Figure 1 depicts the mapping weight matrix W by a state vector x. Every from an example graph with five vertices to a entry of the symmetric matrix W (wji)records Boltzmann machine with five nodes. The the weight between two units (j and i); every Boltzmann machine is used to solve a Max- entry of the vector x(xi) stores the state of a Cut problem. Given an undirected graph single unit (i). Figure 2 depicts the funda- G with N nodes whose connection weights mental concept behind the design of the ...... MAY/JUNE 2017 23 ...... TOP PICKS
the weights (wji) and the state variables (xi); it Configurable CPU Controller is possible to compute the product of weights interconnect and state variables in situ within the data DDRx arrays. The interconnection network permits Main Memristive Array Array Computational memory accelerator 1 n and the accelerator to retrieve and sum these par- storage arrays tial products to compute the final result.
Figure 3. System overview. The proposed accelerator can be selectively Fundamental Building Blocks integrated in general-purpose computer systems. The fundamental building blocks of the pro- posed memristive Boltzmann machine are storage elements, a current summation circuit, a reduction unit, and a consensus unit. The State variables (x) Connection weights (W ) design of these hardware primitives must strike x1 a careful balance among multiple goals: high memory density, low energy consumption, and in situ, fine-grained parallel computation. xn
Row decoder Controller Storage Elements D/S D/S D/S As Figure 4 shows, the proposed accelerator Computational signal Interface to the data interconnect employs the conventional one-transistor, one-memristor (1T-1R) array to store the Figure 4. The proposed array structure. The conventional one-transistor, connection weights (the matrix W). The rele- one-memristor (1T-1R) array structure is employed to build the proposed vant state variables (the vector x)arekept accelerator. close to the data arrays holding the weights. The memristive 1T-1R array is used for both storing the weights and computing the dot memristive Boltzmann machine. The weights product between these weights and the state and the state variables are represented using variables. memristors and transistors, respectively. A constant voltage supply (Vsupply)isconnected Current Summation Circuit to parallel memristors through a shared verti- The result of a dot product computation is cal bitline. The total current pulled from the obtained by measuring the aggregate current voltage source represents the result of the pulled by the memory cells connected to a computation. This current (Ij)issettozero common bitline. Computing the sum of the when xj is OFF; otherwise, the current is bit products requires measuring the total equal to the sum of the currents pulled by the amount of current per column and merging individual cells connected to the bitline. the partial results into a single sum of prod- ucts. This is accomplished by local column System Overview sense amplifiers and a bit summation tree at Figure 3 shows an example of the proposed the periphery of the data arrays. accelerator that resides on the memory bus and interfaces to a general-purpose computer Reduction Unit system via the DDRx interface. This modular To enable the processing of large matrices organization permits the system designers to using multiple data arrays, an efficient data selectively integrate the accelerator in systems reduction unit is employed. The reduction that execute combinatorial optimization and units are used to build a reduction network, machine learning workloads. The memristive which sums the partial results as they are Boltzmann machine comprises a hierarchy of transferred from the data arrays to the con- data arrays connected via a configurable troller. Large matrix columns are partitioned interconnection network. A controller imple- and stored in multiple data arrays, in which ments the interface between the accelerator the partial sums are individually computed. and the processor. The data arrays can store The reduction network merges the partial ...... 24 IEEE MICRO results into a single sum. Multiple such net- works are used to process the weight columns in parallel. The reduction tree comprises a A hierarchy of bit-serial adders to strike a bal- A large ance between throughput and area efficiency. matrix B F.A. Output Figure 5 shows the proposed reduction column Mode Output mechanism. The column is partitioned into Forwarding A four segments, each of which is processed Mode Reduction A+B separately to produce a total of four partial results. The partial results are collected by a Figure 5. The proposed reduction element. The reduction element can reduction network comprising three bimodal operate in forwarding or reduction mode. reduction elements. Each element is config- ured using a local latch that operates in one of two modes: forwarding or reduction. Each reduction unit employs a full adder to com- Decimal point 64 evenly sampled points from sigmoid 1.0 pute the sum of the two inputs when operat- 0.8 ing in the reduction mode. In the forwarding In 0.6 64 × 16 0.4 mode, the unit is used for transferring the lookup 0.2 0 content of one input upstream to the root. table Out (probability) –4 –3 –2 –1 0 1 2 3 4 Bit extension Out In (energy difference)
Accept/Reject Consensus Unit Pseudorandom generator The Boltzmann machine relies on a sigmoi- dal activation function, which plays a key Figure 6. The proposed unit for the activation function. A 64-entry lookup role in both the model’s optimization and table is used for approximating the sigmoid function. machine learning applications. A precise implementation of the sigmoid function, however, would introduce unnecessary energy and performance overheads. The pro- Chip Bank Subbank Mat posed memristive accelerator employs an F/R approximation unit using logic gates and lookup tables to implement the consensus F/R function. As Figure 6 shows, the table con- F/R Reduction Subbank Data tains 64 precomputed sample points of the Controller tree tree array 1 sigmoid function f ðxÞ¼1þe x ,inwhichx varies between –4 and 4. The samples are Figure 7. Hierarchical organization of a chip. A chip controller is employed to evenly distributed on the x-axis. Six bits of a manage the multiple independent banks. given fixed-point value are used to index the lookup table and retrieve a sample value. The most significant bits of the input data are bank 0 while any location of bank 1 is being ANDed and NORed to decide whether the read. Within each bank, a set of sub-banks is input value is outside the domain [–4, 4]; if connected to a shared interconnection tree. so, the sign bit is extended to implement f(x) The bank interconnect is equipped with ¼ 0orf(x) ¼ 1; otherwise, the retrieved sam- reduction units to contribute to the dot prod- ple is chosen as the outcome. uct computation. In the reduction mode, all sub-banks actively produce the partial results, while the reduction tree selectively merges the System Architecture results from a subset of the sub-banks. This The proposed architecture for the memristive capability is useful for computing the large Boltzmann machine comprises multiple banks matrix columns partitioned across multiple and a controller (see Figure 7). The banks sub-banks. Each sub-bank comprises multiple operate independently and serve memory and mats, each of which is composed of a control- computation requests in parallel. For example, ler and multiple data arrays. The sub-bank column 0 can be multiplied by the vector x at tree transfers the data bits between the mats ...... MAY/JUNE 2017 25 ...... TOP PICKS
and the bank tree in a bit-parallel fashion, The memristive Boltzmann machine is inter- thereby increasing the parallelism. faced to a single-core system via a single DDR3-1600 channel. We develop an Data Organization RRAM-based processing-in-memory (PIM) To amortize the peripheral circuitry’s cost, the baseline. The weights are stored within data data array’s columns and rows are time shared. arrays that are equipped with integer and Each sense amplifier is shared by four bitlines. binary multipliers to perform the dot prod- The array is vertically partitioned along the ucts. The proposed consensus units, optimi- bitlines into 16 stripes, multiples of which can zation and training controllers, and mapping be enabled per array computation. This allows algorithms are employed to accelerate the the software to keep a balance between the annealing and training processes. When com- accuracy of the computation and the perform- pared to existing computer systems and ance for a given application by quantizing GPU-based accelerators, the PIM baseline more bit products into a fixed number of bits. can achieve significantly higher performance and energy efficiency because it eliminates On-Chip Control the unnecessary data movement on the mem- The proposed hardware can accelerate opti- ory bus, exploits data parallelism throughout mization and deep learning tasks by appro- the chip, and transfers the data across the priately configuring the on-chip controller. chip using energy-efficient reduction trees. The controller configures the reduction trees, The PIM baseline is optimized so that it maps the data to the internal resources, occupies the same area as that of the memris- orchestrates the data movement among the tive accelerator. banks, performs annealing or training tasks, and interfaces to the external bus. Area, Delay, and Power Breakdown We model the data array, sensing circuits, DIMM Organization drivers, local array controller, and interconnect elements using Spice predictive technology To solve large-scale optimization and models4 of n-channel and p-channel metal- machine learning problems whose state oxide semiconductor transistors at 22 nm. spaces do not fit within a single chip, we can Thefulladders,latches,andcontrollogicare interconnect multiple accelerators on a synthesized using FreePDK5 at 45 nm. We DIMM. Each DIMM is equipped with con- first scale the results to 22 nm using scaling trol registers, data buffers, and a controller. parameters reported in prior work,6 and then This controller receives DDRx commands, scale them using the fan-out of 4 (FO4) data, and address bits from the external inter- parameters for International Technology Road- face and orchestrates computation among all map for Semiconductors low-standby-power of the chips on the DIMM. (LSTP) devices to model the impact of using a memory process on peripheral and global Software Support circuitry.7,8 We use McPAT9 to estimate the To make the proposed accelerator visible to processor power. software, we memory map its address range Figure 8 shows a breakdown of the compu- to a portion of the physical address space. A tational energy, leakage power, computational small fraction of the address space within latency, and die area among different hard- every chip is mapped to an internal RAM ware components. The sense amplifiers and array and is used to implement the data buf- interconnects are the major contributors to fers and configuration parameters. Software the dynamic energy (41 and 36 percent, configures the on-chip data layout and ini- respectively). The leakage is caused mainly by tiates the optimization by writing to a mem- the current summation circuits (40 percent) ory mapped control register. and other logic (59 percent), which includes thechargepumps,writedrivers,andcontrol- Evaluation Highlights lers. The computation latency, however, is We modify the SESC simulator3 to model a due mainly to the interconnects (49 percent), baseline eight-core out-of-order processor. the wordlines, and the bitlines (32 percent)...... 26 IEEE MICRO Notably, only a fraction of the memory arrays must be active during a computational opera- Others Interconnects Sense amplifiers Data arrays Peak energy (8.6 nJ) tion. A subset of the mats within each bank Leakage power (405 mW) performs current sensing of the bitlines; the Computational latency (6.59 ns) Die area (25.67 mm2) partial results are then serially streamed to the 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% controller on the interconnect wires. The experiments indicate that a fully utilized accel- Figure 8. Area, delay, and power breakdown. Peak energy, leakage erator integrated circuit (IC) consumes 1.3 W, power, computational latency, and die area are estimated at the 22-nm which is below the peak power rating of a technology node. standard DDR3 chip (1.4 W). Performance Figure 9 shows the performance on the Baseline Multithreaded kernel PIM Memristive accelerator 100 proposed accelerator, the PIM architecture, the multicore system running the multi- 10 threaded kernel, and the single-core system running the semidefinite programing (SDP) 1 and MaxWalkSAT kernels. The results are nor- Speedup over the single-threaded kernel single-threaded 0.1 malized to the single-threaded kernel running MS-1 MS-2 MS-3 MS-4 MS-5 MS-6 MS-7 MS-8 MS-9 MC-1 MC-2 MC-3 MC-4 MC-5 MC-6 MC-7 MC-8 MC-9 MS-10 on a single core. The results indicate that the MC-10 single-threaded kernel (Boltzmann machine) is Geomean faster than the baselines (SDP and MaxWalk- SAT heuristics) by an average of 38 percent. Figure 9. Performance on optimization. Speedup of various system The average performance gain for the mul- configurations over the single-threaded kernel. tithreaded kernel is limited to 6 percent, owing to significant state update overheads. PIM outperforms the single-threaded ker- Baseline Multithreaded kernel PIM Memristive accelerator nel by 9.31 times. The memristive accelera- 100 tor outperforms all of the baselines (57.75 10 times speedup over the single-threaded ker- nel and 6.19 times over PIM). Moreover, 1 the proposed accelerator performs the deep 0.1 single-threaded kernel single-threaded learning tasks 68.79 times faster than the Energy savings over the MS-1 MS-2 MS-3 MS-4 MS-5 MS-6 MS-7 MS-8 MS-9 MC-1 MC-2 MC-3 MC-4 MC-5 MC-6 MC-7 MC-8 MC-9 MS-10 single-threaded kernel and 6.89 times faster MC-10 Geomean than PIM. Energy Figure 10. Energy savings on optimization. Energy savings of various system configurations over the single-threaded kernel. Figure 10 shows the energy savings as com- pared to PIM, the multithreaded kernel, SDP, and MaxWalkSAT. On average, energy cycle-to-cycle and device-to-device variabil- is reduced by 25 times as compared to the ities. We evaluate the impact of cycle-to-cycle single-threaded kernel implementation, which variation on the computation’s outcome by is 5.2 times better than PIM. For the deep considering a bit error rate of 10 5 in all of learning tasks, the system energy is improved the simulations, along the lines of the analy- by 63 times, which is 5.3 times better than the sis provided in prior work.10 The proposed energy consumption of PIM. accelerator successfully tolerates such errors, with less than a 1-percent change in the out- Sensitivity to Process Variations come as compared to a perfect software Memristor parameters can deviate from their implementation. nominal values, owing to process variations The resistance of RRAM cells can fluctu- caused by line edge roughness, oxide thick- ate because of the device-to-device variation, ness fluctuation, and random discrete dop- which can impact the outcome of a column ing. These parameter deviations result in summation—that is, a partial dot product...... MAY/JUNE 2017 27 ...... TOP PICKS
We use the geometric model of memri- Emerging large-scale applications such as stance variation proposed by Miao Hu and combinatorial optimization and deep learn- colleagues11 to conduct Monte Carlo simu- ing tasks are even more influenced by mem- lations for 1 million columns, each com- ory bandwidth and power problems. In these prising 32 cells. The experiment yields two applications, massive datasets have to be iter- distributions for low resistance (RLO)and atively accessed by the processor cores to high resistance (RHI) samples that are then achieve a desirable output quality, which approximated by normal distributions with consumes excessive memory bandwidth and respective standard deviations of 2.16 and system energy. To address this problem, 2.94 percent (similar to the prior work by numerous software and hardware optimiza- Hu and colleagues). We then find a bit pat- tions using GPUs, clusters based on message tern that results in the largest summation passing interface (MPI), field-programmable error for each column. We observe up to gate arrays, and application-specific inte- 2.6 10 6 deviation in the column con- grated circuits have been proposed in the ductance, which can result in up to 1 bit literature. These proposals focus on energy- error per summation. Subsequent simula- efficient computing with reduced data move- tion results indicate that the accelerator can ment among the processor cores and memory tolerate this error, with less than a 2 percent arrays. These proposals’ performance and change in the outcome quality. energy efficiency are limited by read accesses that are necessary to move the operands from Finite Switching Endurance the memory arrays to the processing units. A RRAM cells exhibit finite switching endur- memory subsystem that allows for in situ ance ranging from 1e6 to 1e12 writes. We computation within its data arrays could evaluate the impact of finite endurance on an address these limitations by eliminating the accelerator module’s lifetime. Because wear is need to move raw data between the memory induced only by the updating of the weights arrays and the processor cores. stored in memristors, we track the number of Designing a platform capable of perform- times that each weight is written. The edge ing in situ computation is a significant chal- weights are written once in optimization lenge. In addition to storage cells, extra problems and multiple times in deep learning circuits are required to perform analog com- workloads. (Updating the state variables, putation within the memory cells, which stored in static CMOS latches, does not decreases memory density and area efficiency. induce wear on RRAM.) We track the total Moreover, power dissipation and area con- number of updates per second to estimate sumption of the required components for the lifetime of an eight-chip DIMM. Assum- signal conversion between analog and digital ing endurance parameters of 1e6 and 1e8 domains could become serious limiting fac- writes, the respective module lifetimes are 3.7 tors. Hence, it is critical to strike a careful bal- and 376 years for optimization and 1.5 and ance between the accelerator’s performance 151 years for deep learning. and complexity. The memristive Boltzmann machine is ata movement between memory cells the first memory-centric accelerator that D and processor cores is the primary con- addresses these challenges. It provides a new tributor to power dissipation in computer framework for designing memory-centric systems. A recent report by the US Depart- accelerators. Large scale combinatorial opti- ment of Energy identifies the power con- mization problems and deep learning tasks sumed in moving data between the memory are mapped onto a memory-centric, non- and the processor as one of the 10 most sig- Von Neumann computing substrate and nificant challenges in the exascale computing solved in situ within the memory cells, with era.12 The same report indicates that by orders of magnitude greater performance and 2020, the energy cost of moving data across energy efficiency than contemporary super- the memory hierarchy will be orders of mag- computers. Unlike PIM-based accelerators, nitude higher than the cost of performing a the proposed accelerator enables computation double-precision floating-point operation. within conventional data arrays to achieve the ...... 28 IEEE MICRO energy-efficient and massively parallel proc- 9. S. Li et al., “McPAT: An Integrated Power, essing required for the Boltzmann machine Area, and Timing Modeling Framework for model. Multicore and Manycore Architectures,” We expect the proposed memory-centric Proc. 36th Int’l Symp. Computer Architec- accelerator to set off a new line of research on ture (ISCA), 2009, pp. 468–480. in situ approaches to accelerate large-scale 10. D. Niu et al., “Impact of Process Variations problems such as combinatorial optimization on Emerging Memristor,” Proc. 47th ACM/ and deep learning tasks and to significantly IEEE Design Automation Conf. (DAC), 2010, increase the performance and energy effi- pp. 877–882. MICRO ciency of future computer systems. 11. M. Hu et al., “Geometry Variations Analysis of Tio 2 Thin-Film and Spintronic Mem- Acknowledgments ristors,” Proc. 16th Asia and South Pacific This work was supported in part by NSF Design Automation Conf., 2011, pp. 25–30. grant CCF-1533762. 12. The Top Ten Exascale Research Challenges, tech. report, Advanced Scientific Comput- ...... ing Advisory Committee Subcommittee, References Dept. of Energy, 2014. 1. E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Mahdi Nazm Bojnordi is an assistant pro- Neural Computing, John Wiley & Sons, fessor in the School of Computing at the 1989. University of Utah. His research focuses on computer architecture, with an emphasis on 2. S.E. Fahlman, G.E. Hinton, and T.J. Sejnow- energy-efficient computing. Nazm Bojnordi ski, “Massively Parallel Architectures for AI: received a PhD in electrical engineering NETL, Thistle, and Boltzmann Machines,” from the University of Rochester. Contact Proc. Assoc. Advancement of AI (AAAI), him at [email protected]. 1983, pp. 109–113. 3. J. Renau et al., “SESC Simulator,” Jan. Engin Ipek is an associate professor in the 2005; http://sesc.sourceforge.net. Department of Electrical and Computer 4. W. Zhao and Y. Cao, “New Generation of Engineering and the Department of Com- Predictive Technology Model for Sub-45nm puter Science at the University of Rochester. Design Exploration,” Proc. Int’l Symp. Qual- His research interests include energy-efficient ity Electronic Design, 2006, pp. 585–590. architectures, high-performance memory sys- 5. “FreePDK 45nm Open-Access Based PDK tems, and the application of emerging tech- for the 45nm Technology Node,” 29 May nologies to computer systems. Ipek received 2014; www.eda.ncsu.edu/wiki/FreePDK. a PhD in electrical and computer engineer- ing from Cornell University. He has received 6. M.N. Bojnordi and E. Ipek, “Pardis: A Pro- the 2014 IEEE Computer Society TCCA grammable Memory Controller for the Young Computer Architect Award, two DDRX Interfacing Standards,” Proc. 39th IEEE Micro Top Picks awards, and an NSF Ann. Int’l Symp. Computer Architecture CAREER award. Contact him at ipek@ece (ISCA), 2012 pp. 13–24. .rochester.edu. 7. N.K. Choudhary et al., “Fabscalar: Compos- ing Synthesizable RTL Designs of Arbitrary Cores Within a Canonical Superscalar Template,” Proc. 38th Ann. Int’l Symp. Com- puter Architecture, 2011, pp. 11–22. 8. S. Thoziyoor et al., “A Comprehensive Mem- ory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hier- Read your subscriptions through archies,” Proc. 35th Int’l Symp. Computer the myCS publications portal at http://mycs.computer.org. Architecture (ISCA), 2008, pp. 51–62...... MAY/JUNE 2017 29 ...... ANALOG COMPUTING IN A MODERN CONTEXT:ALINEAR ALGEBRA ACCELERATOR CASE STUDY
......
THIS ARTICLE PRESENTS A PROGRAMMABLE ANALOG ACCELERATOR FOR SOLVING
SYSTEMS OF LINEAR EQUATIONS.THE AUTHORS COMPENSATE FOR COMMONLY PERCEIVED
DOWNSIDES OF ANALOG COMPUTING.THEY COMPARE THE ANALOG SOLVER’S
PERFORMANCE AND ENERGY CONSUMPTION AGAINST AN EFFICIENT DIGITAL ALGORITHM
RUNNING ON A GENERAL-PURPOSE PROCESSOR.FINALLY, THEY CONCLUDE THAT PROBLEM
CLASSES OUTSIDE OF SYSTEMS OF LINEAR EQUATIONS COULD HOLD MORE PROMISE FOR
ANALOG ACCELERATION. Yipeng Huang Ning Guo ...... As we approach the limits of sili- computing (see the sidebar, “Related Work in Mingoo Seok con scaling, it behooves us to reexamine fun- Analog Computing”). damental assumptions of modern computing, To support modern workloads in the digital Yannis Tsividis even well-served ones, to see if they are hin- era, we observed that modern scientific com- dering performance and efficiency. An analog puting and big data problems are converted to Simha Sethumadhavan accelerator discussed in this article breaks linear algebra problems. To maximize analog Columbia University two fundamental assumptions in modern acceleration’s usefulness, we explored whether computing: in contrast to using digital binary analog accelerators are effective at solving numbers, an analog accelerator encodes num- systems of linear equations, the single most bers using the full range of circuit voltage and important numerical primitive in continuous current. Furthermore, in contrast to operating mathematics. step by step on clocked hardware, an analog For readers not familiar with linear alge- accelerator updates its values continuously. bra, systems of linear equations are often These different hardware assumptions can solved using iterative numerical linear algebra provide substantial gains but would need methods, which start with an initial guess for different abstractions and cross-layer optimi- the entire solution vector and update the sol- zations to support various modern workloads. ution vector over iterations of the algorithm, We draw inspiration from an immense each step further minimizing the difference amount of prior work in analog electronic between the guess and the correct solution.1 ......
30 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE Related Work in Analog Computing Analog computers of the mid-20th century were widely used to required. The analog acceleration technique presented in this solve scientific computing problems, described as ordinary differ- article is a procedural approach to solving problems: there is a ential equations (ODEs). Analog computers would solve those predefined way to convert a system of linear equations under ODEs by setting up analog electronic circuits, whose time-depend- study into an analog accelerator configuration. ent voltage and current were described by corresponding ODEs. The analog computers therefore were computational analogies of physical models. References Our group revisited this model of analog computing for solving 1. G. Cowan, R. Melville, and Y. Tsividis, “A VLSI Analog nonlinear ODEs, which frequently appear in cyber-physical sys- Computer/Digital Computer Accelerator,” IEEE J. Solid- tems workloads, with higher performance and efficiency com- State Circuits, vol. 41, no. 1, 2006, pp. 42–53. pared to digital systems.1,2 The analog, continuous-time output of analog computing is especially suited for embedded systems 2. N. Guo et al., “Energy-Efficient Hybrid Analog/Digital applications in which sensor inputs are analog and actuators can Approximate Computation in Continuous Time,” IEEE J. use such results directly. The question for this article is whether Solid-State Circuits, vol. 51, no. 7, 2016, pp. 1514–1524. analog acceleration can help conventional workloads in which 3. M.T. Chu, “On the Continuous Realization of Iterative inputs and outputs are digital. Processes,” SIAM Rev., vol. 30, no. 3, 1988, pp. Modern scientific computation and big data workloads are 375–387. phrased as linear algebra problems. In this article, our analog 4. O. Bournez and M.L. Campagnolo, A Survey on Continu- accelerator solves an ODE that does steepest descent, in turn ous Time Computations, Springer, 2008, pp. 383–423. solving a linear algebra problem. Such a solving method belongs 5. R. LiKamWa et al., “RedEye: Analog ConvNet Image Sen- to a broad class of ODEs that can solve other numerical problems, sor Architecture for Continuous Mobile Vision,” including nonlinear systems of equations.3,4 These ODEs point to SIGARCH Computer Architecture News, vol. 44, no. 3, other ways analog accelerators can support modern workloads. 2016, pp. 255–266. We draw a distinction between our approach to analog accel- eration and that of using analog circuits to build neural net- 6. A. Shafiee et al., “ISAAC: A Convolutional Neural Net- works.5,6 Most importantly, we do not use training to get a work Accelerator with In-Situ Analog Arithmetic in Cross- network topology and weights that solve a given problem. No bars,” SIGARCH Computer Architecture News, vol. 44, prior knowledge of the solution or training set of solutions is no. 3, 2016, pp. 14–26.
Efficient iterative methods such as the conju- itely many iterations. This continuous trajec- gate gradient method are increasingly impor- tory from the original guess vector to the tant because intermediate guess vectors are a correct solution is an ordinary differential good approximation of the correct solution. equation (ODE), which states that the change In discrete-time-iterative linear algebra in a set of variables is a function of the varia- algorithms, the solution vector changes in bles’ present value. We can naturally solve steps, and each step is characterized by a step ODEs using an analog accelerator. size. The step size affects the algorithm’s effi- We give an example of an analog accelera- ciency and requires many processor cycles to tor solving an ODE that in turn solves a calculate. In the conjugate gradient method, system of linear equations. At the analog for example, the step size is calculated from accelerator’s heart are integrators, which con- previous step sizes and the gradient magni- tain the present guess of the solution vector tude, and this calculation takes up half of the represented as an analog signal evolving as a multiplication operations in each conjugate function of time (see Figure 1). We perform gradient step. operations on this solution vector by feeding In an analog accelerator, systems of linear the vector through multiplier and summation equations can also be thought of as solved units. Digital-to-analog converters (DACs) via an iterative algorithm, with an important provide constant coefficients and biases. distinction that the guess vector is updated Using these function units, we create a lin- using infinitesimally small steps, over infin- ear function of the solution vector, which is ...... MAY/JUNE 2017 31 ...... TOP PICKS
Explicit Data-Graph Execution Architecture The analog accelerator uses an explicit data- –a00 dx 0 x (t) flow graph in which the sequence of opera- b0 dt 0 DAC ADC tions on data is realized by connecting functional units end to end. During compu- –a10 tation, analog signals representing intermedi- ate results flow from one unit to the next, so –a 01 there are no overheads in fetching and decod- dx1 b1 dt x1(t) ing instructions, and there are no accesses to DAC ADC digital memory. The former is a benefit of
–a11 digital accelerators, too, but the latter is a unique benefit of the analog computational model. Figure 1. Schematic of an analog accelerator for solving Ax 5 b, a linear Continuous Time Speed system of two equations with two The analog accelerator hardware and algo- unknown variables. Matrix A is a known rithm both operate in continuous time. The matrix of coefficients realized using values contained in the integrators are contin- multipliers; x is an unknown vector uously being updated, and the update rate is contained in integrators; b is a known vector not limited by a finite clock frequency, which of biases generated by digital-to-analog is the limiting factor in discrete-time hard- converters (DACs). Signals are encoded as ware. Furthermore, a continuous-time ODE analog current and are copied using current solution has no concern about the correct mirror fan-out blocks. The solver converges step size to take to update the solution vec- if matrix A is positive definite, which is tor, in contrast to discrete-time iterative algo- usually true for the problems we discuss. rithms, in which computing the correct step size represents most operations needed per algorithm iteration. In these regards, the fed back to the inputs of the integrators. In analog accelerator is potentially faster than this fully formed circuit, the solution vector’s discrete-time architectures. Finally, no power- time derivative is a linear function of the sol- hungry clock signal is needed to synchronize ution vector itself. operations. The integrators are charged to an initial condition representing the iterative method’s Continuous Value Efficiency initial guess. The accelerator starts computa- The analog accelerator solves the system of tion by releasing the integrator, allowing its linear equations using real numbers encoded output to deviate from its initial value. The in voltage and current, so each wire can rep- variables contained in the integrators con- resent the full range of values in the analog verge on the correct solution vector that satis- accelerator. In contrast, changing the value of fies the system of linear equations. When the a digital binary number affects many bits: analog variables are steady, we sample them sweeping an 8-bit unsigned integer from 0 to using analog-to-digital converters (ADCs). 255 needs 502 binary inversions, whereas a These techniques were used in early ana- more economical Gray encoding still needs 2–4 log computers and have recently been 255 inversions. Furthermore, multiplication, explored in small-scale experiments with ana- addition, and integration are all comparatively 5,6 log computation. straightforward on analog variables compared to digital ones. This contrasts with floating- Analog Linear Algebra Advantages point arithmetic, in which the logarithmically Solving linear algebra problems using ODEs encoded exponent portion of digital floating- on an analog accelerator has several potential point variables makes it complicated to add advantages compared to using a discrete- and subtract variables. In these regards, analog time algorithm on a digital general-purpose encoding is potentially more efficient than or special-purpose system. digital, binary encodings...... 32 IEEE MICRO Table 1. Analog accelerator instruction set architecture.
Instruction type Instruction Parameters Instruction effect
Control Initialize — Find input and output offset and gain calibration settings for all function units. Configuration Set connection Source, destination Set a crossbar switch to create an analog current connection between two analog interfaces. Configuration Set initial condition Pointer to an integrator, initial Charge integrator capacitors to have ODE initial condition value condition value. Configuration Set multiplier gain Pointer to a multiplier, gain value Amplify values by constant coefficient gain. Configuration Set constant offset Pointer to a DAC, offset value Add a constant bias to values. Configuration Set timeout time Number of digital controller Stop analog computation after specified time clock cycles once started. Configuration Configuration commit — Finish configuration and write any new changes to chip registers. Control Execution start — Start analog computation by letting integrators deviate from initial conditions. Control Execution stop — Stop analog computation by holding integrators at their present value. Data input Enable analog input Pointer to chip analog input Open chip analog input channel, allowing multi- channel ple chips to participate in computation. Data output Read analog value Pointer to an ADC, memory Read analog computation results from ADCs and location to store result store values. Exception Read exceptions Memory location to store result Read the exception vector indicating whether ...... analog units exceeded their valid range.....
*ADC: analog-to-digital converter; DAC: digital-to-analog converter; ODE: ordinary differential equation.
Analog Accelerator Architecture currents represent variables. Fan-out cur- The analog accelerator acts as a peripheral to rent mirrors allow the analog circuit to copy a digital host processor. The analog accelera- variables by replicating values onto different tor interface accepts an accelerator configura- branches. To sum variables, currents are added tion, which entails the connectivity between together by joining branches. Eight multi- function units, multiplier gains, DAC con- pliers allow variable-variable and constant- stants, and integrator initial conditions. variable multiplication. Additionally, the interface allows calibration, The physical prototype validates the ana- computation control, and reading of output log circuits’ functionality and allows physical values, and reporting exceptions. Table 1 measurement of component area and energy. summarizes the analog accelerator’s essential Additionally, the chip allows rapid prototyp- system calls and corresponding instructions. ing of accelerator algorithms. Using physical timing, power, and area measurements recorded by Ning Guo and Analog Accelerator Physical Prototype colleagues7 and summarized in Table 2, we We tested analog acceleration for linear alge- built a model that predicts the properties of bra using a prototype reconfigurable analog larger-scale analog accelerators. In Table 2, accelerator silicon chip implemented in “analog core power” and “analog core area” 65-nm CMOS technology (see Figures 2 show the power and area of each block that and 3). The accelerator comprises four inte- forms the analog signal path. The noncore grators, plus accompanying DACs, multi- transistors and nets not involved in analog pliers, and ADCs connected using crossbar computation include calibration and testing switches. In our analog accelerator, electrical circuits and registers. The core area and power ...... MAY/JUNE 2017 33 ...... TOP PICKS
scale up and down for different analog band- width designs. We explore how different 8 fan-out blocks 4 integrators 8 multiplier/VGAs 4 analog inputs bandwidth choices influence analog accel- erator performance and efficiency.
Mitigation of Analog Linear Algebra Disadvantages outputs 4 analog We encountered several drawbacks of analog computing, including limited accuracy, pre- cision, and scalability. We tackled each of these problems in the context of solving linear algebra, although the techniques we dis- cuss apply to other styles of analog computer architecture.
Improve Accuracy Using Calibration and Analog Exceptions Analog circuits provide limited accuracy compared to binary ones, in which values are CT CT CT CT
ADC ADC DAC DAC unambiguously interpreted as 0 or 1. Analog 8 8 8 8 hardware uses the full range of values. Subtle SPI controller variations in analog hardware due to process input
8 Digital 8 8 8 8 8 4 and temperature variation lead to undesirable Digital variations in the computation result. SPI SRAM SRAM output We identify three main sources of inaccur- acy in analog hardware: gain error, offset Figure 2. Analog accelerator architecture diagram, showing rows of analog, error, and nonlinearity. Gain and offset errors mixed-signal, and digital components, along with crossbar interconnect.7 refer to inaccurate results in multiplication “CT” refers to continuous time. Static RAMs (SRAMs) are used as lookup and summation, which can be calibrated tables for nonlinear functions (not used for the purposes of this article). away using additional DACs that adjust cir- cuit parameters to shift signals and adjust gains. These DACs are controlled by registers, whose contents are set using binary search during calibration by the digital host. The set- tings vary across different copies of functional units and accelerator chips, but remain con- stant during accelerator operation. 8× Nonlinearity errors occur when changes 8× 4× multiplier/ fan-out integrator in inputs result in disproportionate changes VGA in outputs, and when analog values exceed the range in which the circuit’s behavior is × × 2 CT ADC 2 CT DAC mostly linear, resulting in clipping of the SPI controller output, akin to overflow of digital number 2 × SRAM representations. At the same time, the host observes if the dynamic range is not fully used, which could result in low precision. When either exception type occurs, the origi- Figure 3. Die photo of an analog accelerator nal problem is rescaled to fit in the dynamic chip fabricated in 65-nm CMOS technology, range of the analog accelerator, and computa- 7 showing major components. “VGAs” are tion is reattempted. variable-gain amplifiers. The die area is The combination of widespread calibra- 2 3.8 mm . tion and exception checking ensures that the ...... 34 IEEE MICRO Table 2. Summary of analog accelerator components.
Unit type Analog core Total unit Analog core Total unit power power area area
Integrator 22 lW28lW 0.016 mm2 0.040 mm2 Fan-out 30 lW37lW 0.005 mm2 0.015 mm2 Multiplier 39 lW49lW 0.024 mm2 0.050 mm2 ADC 27 lW54lW 0.049 mm2 0.054 mm2 DAC 4.6 lW 4.6 lW 0.013 mm2 0.022 mm2 analog solution’s accuracy is within the sam- Analog accelerators can solve large-scale pling resolution of ADCs. sparse linear algebra problems by accelerating the solving of smaller subproblems. This lets Improve Sampling Precision by Focusing on analog accelerators solve problems containing Analog Steady State more variables than the number of integra- High-frequency and high-precision analog- tors in the analog accelerator. to-digital conversion is costly. So, instead of In such a scheme, the analog accelerator trying to capture the time-dependent analog finds the correct solution for a subproblem. waveform, we use the analog accelerator as a To get overall convergence across the entire linear algebra solver by solving a convergent problem, the set of subproblems would be ODE. When the analog accelerator outputs solved several times, using an outer loop iter- are steady, we can sample the solutions once ating across the subproblems. Typically, the with higher-precision ADCs. larger iteration is an iterative method operat- Even then, high-precision ADCs still fall ing on vectors, which do not have as strong short of the precision in floating-point num- convergence properties as iterative methods bers. Even though the analog variables are do on individual numbers. Therefore, it is still themselves highly precise, sampling the varia- desirable to ensure the block matrices cap- bles using ADCs can result in only 8 to 12 tured in the analog accelerator are large, so bits of precision. We get higher-precision that more of the problem is solved using the results by running the analog accelerator mul- efficient lower-level solver. tiple times. We use the digital host computer to find the residual error in the solution, and Evaluation we set up the analog accelerator to solve a new We compare the analog accelerator and digi- problem, focusing on the residual. Each prob- tal approaches in terms of performance, hard- lem has smaller-magnitude variables than pre- ware area, and energy consumption, while viousruns,whichletsusscaleupthevariables varying the number of problem variables and to fit the dynamic range of the analog hard- the choice of analog accelerator component ware. We can iterate between analog and digi- bandwidth, a measure of how quickly the tal hardware a few times to get a more precise analog circuit responds to changes. result than using the analog hardware alone. Analog Bandwidth Model Tackle Larger Problems by Accelerating Sparse The prototype chip has a relatively low analog Linear Algebra Subproblems bandwidth of 20 KHz, a design that ensures Modern workloads routinely need thousands that the prototype chip accurately solves for of variables, corresponding to as many analog time-dependent solutions in ODEs. How- integrators in the accelerator, exceeding the ever, the prototype’s small bandwidth makes area constraints of realistic analog accelerators. it unrepresentative of an analog accelerator Furthermore, the analog datapath is fixed designed to solve time-independent algebraic during continuous time operation, so there is equations, in which accuracy degradation in no way to dynamically load variables from time-dependent behavior has no impact on and store variables to main memory. the final steady state output. We scale up the ...... MAY/JUNE 2017 35 ...... TOP PICKS
solution time, but also increases area and 1.2 energy consumption. As Figures 4 and 5 1.0 show, we assume an analog accelerator with 0.8 bandwidth multiplied by a factor of a has 0.6 higher power and area consumption in the core analog circuits, by a factor of a. 0.4 The projected analog power figures are 0.2 significantly below the thermal design power 0.0 Maximum activity power (W) 0 500 1,000 1,500 2,000 of clocked digital designs of equal area. Even 2 Total grid points in the designs that fill a 600 mm die size, the 20 KHz 80 KHz 320 KHz 1.3 MHz analog accelerator uses about 0.7 W in the base prototype design and about 1.0 W in the design with 320 KHz of bandwidth. Figure 4. Power versus analog accelerator size for various bandwidth choices. We Sparse Linear Algebra Case Study observe that analog circuits operate faster We use as our test case a sparse system of when the internal node voltages linear equations derived from a multigrid representing variables change more quickly. elliptic partial differential equation (PDE) We hold the capacitance fixed to the solver. In multigrid PDE solvers, the overall capacitance of the prototype’s design, and PDE is converted to several linear algebra use larger currents that draw more power problems with varying spatial resolution. to charge and discharge the node Lower-resolution subproblems are quickly capacitances in the signal paths carrying solved and fed to high-resolution subpro- variables. blems, aiding the high-resolution problem to converge faster. The linear algebra subpro- blems can be solved approximately. Overall 6.00E+02 accuracy of the solution is guaranteed by iter- ating the multigrid algorithm. Because perfect ) 2 4.00E+02 convergence is not required, less stable, inac- curate, and low-precision techniques, such as analog acceleration, can support multigrid.
Area (mm Area 2.00E+02 In our case, we compare the analog accel- erator designs to a conjugate gradient algo- 0.00E+00 0 500 1,000 1,500 2,000 rithm running on a CPU, solving to equal Total grid points (relatively low) solution precision, equivalent 20 KHz 80 KHz 320 KHz 1.3 MHz to the precision obtained from one run of the analog accelerator equipped with high- resolution ADCs. On the digital side, the Figure 5. Area versus analog accelerator numerical iteration stops short of the machine size for various bandwidth choices. We precision provided by high-precision digital observe that the transistor aspect ratio W/L floating-point numbers. must increase to increase the current, and The conjugate gradient algorithm uses a therefore bandwidth, of the design. L is sustained 20 clock cycles per numerical itera- kept at a minimum dictated by the tion per row element. The comparison technology node, leaving bandwidth to be assumes identical transfer cost of data from linearly dependent on W. Thus, we main memory to the accelerator versus the estimate area increasing linearly with CPU: the energy needed to transfer data to bandwidth. and from memory is not modeled, due to the relatively small problem sizes, allowing the model’s bandwidth, within reason, up to program data to be entirely cache resident. 1.3 MHz. As Figure 6 shows, we found that an opti- Increasing the bandwidth of the analog mal analog accelerator design that balances circuit design proportionally decreases the performance and the number of integrators ...... 36 IEEE MICRO should have components with an analog bandwidth of approximately 320 KHz. With 200 our bandwidth model, high-bandwidth ana- log computers come with high area cost,
s) 150 quickly reaching the area of the largest CPU µ or GPU dies. On performance and energy metrics, we find that, with 400 integrators 100 operatingat320KHzofanalogbandwidth, analog acceleration can potentially have a 10- 50 times faster solution time; using our analog Convergence time ( bandwidth model for power, this design corresponds to 33 percent lower energy con- 0 sumption compared to a digital general- 0 200 400 600 purpose processor. Total grid points Digital conjugate gradients e recognize that the performance in- Analog 20 KHz Linear (analog 20 KHz) creases and energy savings are not as W Linear (analog 80 KHz projection) drastic as one expects when using a domain- Linear (analog 320 KHz projection) specific accelerator built on a fundamentally Linear (analog 1.3 MHz projection) different computing model than digital, syn- chronous computing. The reason for this shortfall is twofold. Figure 6. Comparison of time taken to First, the high area cost of high-bandwidth converge to equivalent precision, for high- analog components limits the problem sizes bandwidth analog accelerators and a digital that can fit in the accelerator, and therefore CPU. The time needed to converge is limits the analog performance advantage. plotted against the linear algebra problem Second, the extreme importance of linear vector size. We give the projected solution algebra problems has also led to intense time for 80-KHz, 320-KHz, and 1.3-MHz research in optimal algorithms and hardware analog accelerator designs. The high- support. Although discrete-time operation has bandwidth designs have increasing area cost. In this plot, the 320-KHz and 1.3-MHz drawbacks, it permits algorithms to intelli- 2 gently select a step size, which has advantages designs hit the size of 600 mm , the size of in solving systems of linear equations. Both the largest GPUs, so the projections are cut the analog and digital solvers perform iterative short. The convergence time for digital is numerical algorithms, but the digital program the software runtime on a single CPU core. runs the conjugate gradient method, the most efficient and sophisticated of the classical relaxation or steepest descent. Although we iterative algorithms. In the conjugate gradient can consider the analog accelerator as doing method, each step size is chosen, considering continuous-time steepest descent, taking many the gradient magnitude of the present point, infinitesimal steps in continuous time, doing alongwiththehistoryofstepsizes.Withthese many iterations of a poor algorithm is in this additional calculations, the conjugate gradient case no match for a better algorithm. method avoids taking redundant steps, accel- Efficient discrete-time algorithms such as erating toward the answer when the error is conjugate gradient and multigrid have been large and slowing when close to convergence. known to researchers since the 1950s. Analog In contrast, the analog accelerator has computers remained in use in the 1960s to fewer iterative algorithms it can carry out. In solve steepest descent due to their better using the analog accelerator for linear alge- immediate performance relative to early digi- bra, the design’s bandwidth limits the con- tal computers. vergence rate, so the convergence rate within Changing the basic abstractions in com- a time interval cannot be arbitrarily large. puter architecture could change what types Therefore, the numerical iteration in the of problems are solvable. Interesting physi- analog accelerator is akin to fixed-step size cal phenomena are usually continuous-time, ...... MAY/JUNE 2017 37 ...... TOP PICKS
analog, nonlinear, and often stochastic, so the IEEE 43rd Ann. Int’l Symp. Computer Archi- computer architectures and mathematical tecture (ISCA), 2016, pp. 570–582. abstractions for simulating these processes should also be continuous-time and analog. Yipeng Huang is a PhD candidate in the Although analog acceleration has limited ben- Computer Architecture and Security Tech- efits for solving linear algebra, analog accelera- nologies Lab at Columbia University. His tion holds promise in problem classes such as research interests include applications of nonlinear systems, in which digital algorithms analog computing and benchmarking of and hardware architectures have been less suc- robotic systems. Huang has an MPhil in cessful. In this regard, this article could be the computer science from Columbia Univer- first in a line of work redefining what prob- sity. He is a member of the IEEE Computer lems are tractable and should be pursued for Society and ACM SIGARCH. Contact him analog computing. MICRO at [email protected].
Ning Guo is a hardware engineer at Cog- Acknowledgments nescent. His research interests include con- This work is supported by NSF award CNS- tinuous-time analog/hybrid computing and 1239134 and a fellowship from the Alfred P. energy-efficient approximate computing. Sloan Foundation. This article is based on 8 Guo received a PhD in electrical engineer- our ISCA 2016 paper. ing from Columbia University, where he performed the work for this article. Contact ...... him at [email protected]. References 1. W.H. Press et al., Numerical Recipes: The Mingoo Seok is an assistant professor in the Art of Scientific Computing, 3rd ed., Cam- Department of Electrical Engineering at bridge Univ. Press, 2007. Columbia University. His research interests 2. W. Chen and L.P. McNamee, “Iterative Sol- include low-power, adaptive, and cognitive ution of Large-Scale Systems by Hybrid VLSI systems design. Seok received a PhD Techniques,” IEEE Trans. Computers, vol. in electrical engineering from the Univer- C-19, no. 10, 1970, pp. 879–889. sity of Michigan, Ann Arbor. He has 3. W.J. Karplus and R. Russell, “Increasing received an NSF CAREER award and is a Digital Computer Efficiency with the Aid of member of IEEE. Contact him at mgseok@ Error-Correcting Analog Subroutines,” IEEE ee.columbia.edu. Trans. Computers, vol. C-20, no. 8, 1971, Yannis Tsividis is the Edwin Howard Arm- pp. 831– 837. strong Professor of Electrical Engineering at 4. G. Korn and T. Korn, Electronic Analog and Columbia University. His research interests Hybrid Computers, McGraw-Hill, 1972. include analog and hybrid analog/digital 5. C.C. Douglas, J. Mandel, and W.L. Miranker, integrated circuit design for computation “Fast Hybrid Solution of Algebraic Systems,” and signal processing. Tsividis received a SIAM J. Scientific and Statistical Computing, PhD in electrical engineering from the Uni- vol. 11, no. 6, 1990, pp. 1073–1086. versity of California, Berkeley. He is a Life 6. Y. Zhang and S.S. Ge, “Design and Analysis Fellow of IEEE. Contact him at tsividis@ of a General Recurrent Neural Network ee.columbia.edu. Model for Time-Varying Matrix Inversion,” Simha Sethumadhavan is an associate pro- IEEE Trans. Neural Networks, vol. 16, no. 6, fessor in the Department of Computer Science 2005, pp. 1477–1490. at Columbia University. His research interests 7. N. Guo et al., “Energy-Efficient Hybrid Ana- include computer architecture and computer log/Digital Approximate Computation in security. Sethumadhavan received a PhD in Continuous Time,” IEEE J. Solid-State Cir- computer science from the University of Texas cuits, vol. 51, no. 7, 2016, pp. 1514–1524. at Austin. He has received an Alfred P. Sloan 8. Y. Huang et al., “Evaluation of an Analog fellowship and an NSF CAREER award. Con- Accelerator for Linear Algebra,” Proc. ACM/ tact him at [email protected]...... 38 IEEE MICRO
...... DOMAIN SPECIALIZATION IS GENERALLY UNNECESSARY FOR ACCELERATORS
......
DOMAIN-SPECIFIC ACCELERATORS (DSAS), WHICH SACRIFICE PROGRAMMABILITY FOR
EFFICIENCY, ARE A REACTION TO THE WANING BENEFITS OF DEVICE SCALING.THIS ARTICLE
DEMONSTRATES THAT THERE ARE COMMONALITIES BETWEEN DSASTHATCANBE
EXPLOITED WITH PROGRAMMABLE MECHANISMS.THE GOALS ARE TO CREATE A
PROGRAMMABLE ARCHITECTURE THAT CAN MATCH THE BENEFITS OF A DSA AND TO
Tony Nowatzki CREATE A PLATFORM FOR FUTURE ACCELERATOR INVESTIGATIONS. University of California, Los Angeles ...... Performance improvements from types (server, mobile, wearable), and creating general-purpose processors have proved elusive fundamentally new designs for each costs Vinay Gangadhar in recent years, leading to a surge of interest in both design and validation time. More sub- more narrowly applicable architectures in the tly, most devices run several different impor- Karthikeyan hope of continuing system improvements in at tant workloads (such as mobile systems on least some significant application domains. A chip), and therefore multiple DSAs will be Sankaralingam popular approach so far has been building required; this could mean that although domain-specific accelerators (DSAs): hardware each DSA is area-efficient, a combination of University of engines capable of performing computations DSAs might not be. in a particular domain with high performance Critically, the alternative to domain spe- Wisconsin–Madison and energy efficiency. DSAs have been devel- cialization is not necessarily standard general- oped for many domains, including machine purpose processors, but rather programmable Greg Wright learning, cryptography, regular expression and configurable architectures that employ matching and parsing, video decoding, and similar microarchitectural mechanisms for Qualcomm Research databases. DSAs have been shown to achieve specialization. The promises of such an 10 to 1,000 times performance and energy architecture are high efficiency and the abil- benefits over high-performance, power-hungry ity to be flexible across workloads. Figure 1a general-purpose processors. depicts the two specialization paradigms at a For all of their efficiency benefits, DSAs high level, leading to the central question of sacrifice programmability, which makes them this article: How far can the efficiency of prone to obsoletion—the domains we need programmable architectures be pushed, and to specialize, as well as the best algorithms to can they be competitive with domain-specific use, are constantly evolving with scientific designs? progress and changing user needs. Moreover, To this end, this article first observes that the relevant domains change between device although DSAs differ greatly in their design ......
40 Published by the IEEE Computer Society 0272-1732/17/$33.00 c 2017 IEEE choices, they all employ a similar set of spe- cialization principles: Cache Matching of the hardware concurrency Traditional to the enormous parallelism typically multicore present in accelerated algorithms. (Specialization Core Core Core Core Problem-specific functional units alternatives) (FUs) for computation. Domain-specific acceleration Programmable specialization Deep neural, Explicit communication of data as RegExp AI Neural Cache neural Cache approxi- approximation opposed to implicit transfer through Graph mation Scan Graph, AI, shared (register and memory) address traversal Linear RegExp algebra Deep Stencil, Scan, Core Core spaces in a general-purpose instruc- Core Core Core Stencil Core tion set architecture (ISA). neural Sort Linear, Sort