A Vector-Dataflow Architecture for Ultra-Low-Power Embedded Systems

MANIC: A Vector-Dataflow Architecture for Ultra-Low-Power Embedded Systems Graham Gobieski Amolak Nagi Nathan Serafin [email protected] [email protected] [email protected] Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Mehmet Meric Isgenc Nathan Beckmann Brandon Lucia [email protected] [email protected] [email protected] Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University ABSTRACT 1 INTRODUCTION Ultra-low-power sensor nodes enable many new applications and are The emergence of tiny, pervasively deployed, ultra-low-power sensor becoming increasingly pervasive and important. Energy efficiency is systems enables important new applications in environmental sens- the key determinant of the value of these devices: battery-powered ing, in- and on-body medical implants, civil infrastructure monitors, nodes want their battery to last, and nodes that harvest energy should and even tiny chip-scale satellites. Existing systems for these applica- minimize their time spent recharging. Unfortunately, current devices tions suffer fundamental inefficiencies that demand new, extremely are energy-inefficient. energy-efficient computer architectures. In this work, we present MANIC, a new, highly energy-efficient Sensing workloads are increasingly sophisticated: Sensor devices architecture targeting the ultra-low-power sensor domain. MANIC collect data from a deployed environment and must process raw data achieves high energy-efficiency while maintaining programmabil- to support applications. Processing varies and may entail digital ity and generality. MANIC introduces vector-dataflow execution, signal processing (DSP), computing statistics, sorting, or sophisti- allowing it to exploit the dataflows in a sequence of vector instruc- cated computations such as machine learning (ML) inference using tions and amortize instruction fetch and decode over a whole vector a deep neural network (DNN) or a support vector machine (SVM). of operations. By forwarding values from producers to consumers, As processing sophistication has increased, sensor device capabil- MANIC avoids costly vector register file accesses. By carefully ity also matured to include high-definition image sensors [61] and scheduling code and avoiding dead register writes, MANIC avoids multi-sensor arrays [50], increasing sensed data volume. costly vector register writes. Across seven benchmarks, MANIC is This shift poses a challenge: how can we perform sophisticated on average 2:8× more energy efficient than a scalar baseline, 38:1% computations on simple, ultra-low-power systems? One design is more energy-efficient than a vector baseline, and gets to within to offload work by wirelessly transmitting data to a more powerful 26:4% of an idealized design. nearby computer (e.g., at the “edge” or cloud) for processing. In offloading, the more data a sensor produces, the more data the device CCS CONCEPTS must communicate. Unfortunately, transmitting data takes much • Computer systems organization ! Single instruction, multiple more energy per byte than sensing, storing, or computing on those data; Data flow architectures. data [32, 52]. While a high-powered device like a smartphone, with a high-bandwidth, long-range radio, can afford to offload data to KEYWORDS the edge or cloud, this is not practical for power-, energy-, and bandwidth-limited sensor devices [26, 32]. Vector, dataflow, low-power, sensor, energy-harvesting Since offloading is infeasible, the alternative is to process data locally on the sensor node itself. For example, recent work [32] ACM Reference Format: has shown how systems can use commodity off-the-shelf microcon- Graham Gobieski, Amolak Nagi, Nathan Serafin, Mehmet Meric Isgenc, trollers (COTS MCU) to filter sensed data so that only meaningful Nathan Beckmann, and Brandon Lucia. 2019. MANIC: A Vector-Dataflow Architecture for Ultra-Low-Power Embedded Systems. In The 52nd Annual data (as defined by the application) are transmitted. Processing data IEEE/ACM International Symposium on Microarchitecture (MICRO-52), locally at a sensor node eliminates most of the high energy cost October 12–16, 2019, Columbus, OH, USA. ACM, New York, NY, USA, of communication, but makes the device highly sensitive to the 15 pages. https://doi.org/10.1145/3352460.3358277 energy-efficiency of computation. There are two key criteria that make a computation-heavy sensor system effective. First, the device must process data locally at a low Permission to make digital or hard copies of part or all of this work for personal or operating power and with extremely high energy-efficiency. Second, classroom use is granted without fee provided that copies are not made or distributed the device must be programmable and general to support a wide for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. variety of applications. These goals are in tension, since programma- For all other uses, contact the owner/author(s). bility often carries a significant energy penalty. Our goal is to design MICRO-52, October 12–16, 2019, Columbus, OH, USA a highly programmable architecture that hides microarchitectural © 2019 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-6938-1/19/10. complexity while eliminating the energy costs of programmability. https://doi.org/10.1145/3352460.3358277 MICRO-52, October 12–16, 2019, Columbus, OH, USA Gobieski, et al. Existing low-power architectures fall short: Ultra-low-power COTS ASIC Ideal MCUs used in many deeply embedded sensor nodes (e.g., TI MSP430, ELM ARM M0+ & M4+) fail to meet the criteria for an effective sensor MANIC node. These MCUs are general-purpose, programmable devices that Energy- support a variety of applications. However, COTS MCUs pay a Efficiency Classic Vector high power, energy, and performance cost for their generality and Better programmability (see the COTS MCU dot in Fig. 1). COTS MCU Programmability is expensive in two main ways [7, 34, 39]. First, instruction supply consumes significant energy: in the best case, Ease of Programming the energy of an instruction cache hit, and in the worst case, the energy of a main memory read and instruction cache fill. Lacking Figure 1: MANIC seeks to improve energy efficiency without compro- mising programmability. sophisticated microarchitectural features such as superscalar and out-of-order execution pipelines [41, 78], the energy overhead of instruction supply constitutes a significant fraction of total operating systems. As depicted in Fig. 1, MANIC is closest to the Ideal design, energy. Second, data supply through register file (RF) access also achieving high energy-efficiency while remaining general-purpose consumes significant energy. Together, we find that instruction and and simple to program. MANIC is simple to program because it data supply consume 54:4% of the average execution energy in our exposes a standard vector ISA interface based on the RISC-V vector workloads. extension [69]. Programming pitfalls of architectural specialization: To combat MANIC achieves high energy-efficiency by eliminating the two the energy costs of generality, some recent work has turned to mi- main costs of programmability through its vector-dataflow design. croarchitectural specialization, making a system energy-efficient at First, vector execution amortizes instruction supply energy over a the expense of generality and programmability [13–15, 27, 51, 80]. large number of operations. Second, MANIC addresses the high Specialization customizes a system’s control and datapath to accom- cost of VRF accesses through its dataflow component by forward- modate a particular workload (e.g., deep neural networks [13, 15]), ing operands directly between vector operations. MANIC transpar- eliminating inessential inefficiencies like instruction supply and RF ently buffers vector outputs in a small forwarding buffer and, at access. The downside of specialization is its inability to support a instruction issue, renames vector operands to directly access the for- wide range of applications (see the ASIC dot in Fig. 1). warding buffer, eliminating read accesses to the VRF. Additionally, In contrast to specialization, another approach to programmable MANIC extends the vector ISA with kill annotations that denote energy-efficiency is to target a conventional vector architecture (such the last use of a vector register, eliminating write accesses to the VRF. as NVidia’s Jetson TX2 [64], ARM NEON [2], or TI LEA [42]), The vector-dataflow architecture is efficient because MANIC amor- amortizing the cost of instruction supply across a large number of tizes the energy of tracking dataflow across many vector operations. compute operations. Unfortunately, vector architectures exacerbate MANIC thus eliminates a large fraction of VRF accesses (90.1% on the energy costs of RF access, especially in high-throughput designs average in our experiments) with simple microarchitectural changes with multi-ported vector register files (VRFs) [3, 49, 65], and so re- that leave the basic vector architecture intact. main far from the energy-efficiency of fully specialized designs34 [ ] Finally, we have designed and implemented a code scheduling (see the classic vector dot in Fig. 1). algorithm that exploits MANIC’s operand forwarding to minimize The ELM architecture stands out among prior efforts as an ar- VRF energy, while being microarchitecturally

Load more