GENERAL INTEREST

Cost-Effective and Flexible Asynchronous Interconnect Technology for GALS Systems

Davide Bertozzi, Gabriele Miorandi, and Alberto Ghiribaldi, University of Ferrara, Ferrara, 44122, Italy Wayne Burleson, University of Massachusetts Amherst, Amherst, MA, 01003, USA Greg Sadowski, , Inc., Boxborough, MA, 01719, USA Kshitij Bhardwaj, Weiwei Jiang, and Steven M. Nowick, Columbia University, New York, NY, 10027, USA

In this article, a novel interconnect technology is presented for the cost-effective and flexible design of asynchronous networks-on-chip. It delivers asynchrony in heterogeneous system integration while yielding low-energy on-chip data movement. The approach consists of both a lightweight asynchronous switch architecture (using transition-signaling protocols and bundled-data encoding) and a complete synthesis flow built on top of mainstream industrial CAD tools. For the first time, this article demonstrates compelling area, performance and power benefits when compared to a recent commercial synchronous switch, and the ability of the tool flow to correctly instantiate a complete and competitive network topology.

urrent computing architectures bear less islands of synchronicity that interact over a fully asyn- and less resemblance to the early multicore chronous interconnection network. Cprocessors. On the one hand, critical design Compared to synchronous counterparts, asyn- challenges are being tackled, ranging from the utili- chronous NoCs bring several potential advantages: zation wall and dark silicon issues1 to power/ther- mal management and reliability.2 On the other › No overhead of global clock distribution, tuning, hand, fundamental shifts from traditional von Neu- and management. 3 mann architectures are gaining traction. Among › No need for performance equalization within them, spiking neural-network-based neuromorphic individual unbalanced pipelines, and across 4,5 systems use biological inspiration and obtain different pipelines, in the network, leading to fi energy ef ciency by exploiting asynchronous event- aggregate system-level performance benefits. driven computation. › Support for optimized flit-level performance, From the system design viewpoint, the common tailored to the different timing paths that each challenge to the above trends consists of integrating a flit-type activates, unlike traditional worst-case fi large number of ne-grain computational units while clocked design. decoupling their operating mechanisms and conditions. This challenge motivates the recent surge of inter- However, despite their promise, two main barriers est, in industry and academia, in globally-asynchro- still prevent asynchronous NoCs from fulfilling modern nous locally-synchronous (GALS) architectures, and optimization, scaling and flexibility requirements, thus the design of asynchronous networks-on-chip (NoCs) limiting applicability. to support them.6 In a GALS system, cores are local First, the choice of communication protocols and data-encoding schemes in most state-of-the-art asyn- chronous NoCs aims to simplify hardware design (e.g., using four-phase, or “return-to-zero,” protocols) and to “ ß enforce extreme timing robustness (e.g., using delay- 0272-1732 2020 IEEE ” Digital Object Identifier 10.1109/MM.2020.3002790 insensitive data encoding), at the cost of low through- Date of publication 16 June 2020; date of current version put, high area occupancy, poor coding efficiency, and 20 January 2021. high energy-per-bit.

January/February 2021 Published by the IEEE Computer Society IEEE Micro 69

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

ASYNCHRONOUS COMMUNICATION CHANNELS: PROTOCOLS AND DATA ENCODING

synchronous components communicate via Alternative data-encoding schemes, such as single- A clockless handshaking, which involves defining rail bundled-data [see Figure S1(d)], use moderate timing both a handshaking protocol and a data-encoding constraints, while offering high coding efficiency and low scheme.1,2 energy-per-bit. This approach uses a standard There are two common handshaking protocols. synchronous-style single-rail data channel with binary Four-phase handshaking (“return-to-zero”) requires two data encoding. An extra request (“req”) wire is then round-trip communications per transaction [see Figure S1 “bundled” with the data, serving as a local strobe on (a)], but potentially leads to simpler hardware, since demand, whenever data are sent, along with a backwards signals return to a baseline value (i.e., 0) between acknowledgment (“ack”) wire. This scheme has the transactions. In contrast, in two-phase handshaking benefit of allowing the use of synchronous-style, i.e., (“non-return-to-zero,” or transition- signaling), each hazardous, computation blocks. Both four-phase and control signal makes a single toggle, with no return-to- two-phase protocols are common.1,2 For correct zero phase, incurring only one round-trip communication implementation, a single one-sided relative timing per transaction [see Figure S1(b)]. Hence, two-phase constraint (RTC) must be satisfied, that the req delay is protocols are preferred for high-performance circuits, always longer than worst case data transmission. This though they may lead to more complex hardware. A key bundling constraint is typically met by inserting a small challenge addressed by the current research is to employ matched delay on the control line, when needed.2 Unlike two-phase handshaking extensively in the NoC switch synchronous timing, however, such constraints are while retaining low hardware overhead. localized: there is no global timing constraint, and The most common data-encoding schemes are delay- unbalanced stages can correctly interact with their own insensitive (DI) codes and single-rail bundled data. DI matched delays. However, current commercial CAD codes support robust communication by explicitly tools offer poor support for these RTCs, since they target encoding both data validity and actual data values. Most min/max delay constraints with absolute timing only. common is dual-rail encoding [see Figure S1(c)], where each bit is encoded with two rails or wires. Independent REFERENCES of transmission time or relative bit skew, the receiver can 1. S. M. Nowick and M. Singh, “Asynchronous design - unambiguously identify when each bit is valid using a part 1: Overview and recent advances,” IEEE Des. completion detector. Overall, these codes provide great Test, vol. 32, no. 3, pp. 5–18, Jun. 2015. resilience to physical and operating variability. However, 2. S. M. Nowick and M. Singh, “High-performance fi most DI schemes have poor coding ef ciency and high asynchronous pipelines: An overview,” IEEE Des. Test energy-per-bit, due to their wiring overhead. Comput., vol. 28, no. 5, pp. 8–22, Sep./Oct. 2011.

FIGURE S1. Asynchronous (a-b) handshaking protocols, and (c-d) data-encoding schemes.

70 IEEE Micro January/February 2021

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

ASYNCHRONOUS NOC ARCHITECTURES AND SYNTHESIS TOOL FLOWS

ost asynchronous NoC architectures target are a few promising recent exceptions,7 but the early M highly robust design techniques to facilitate stage of the hierarchical tool flow for network synthesis timing closure, composability, and tolerance of physical prevents the bundled-data NoC from keeping up with and operational delay variations. These designs use performance expectations. The goal of our research is to delay-insensitive (DI) codes on data channels, and a overcome these overheads, in both switch design and so-called “quasi-delay-insensitive” (QDI) style for switch tool development. design—whose only timing assumption is that wire forks are “isochronic,” i.e., have roughly equal branches.1,2 While this approach provides ease-of-design, it typically REFERENCES comes at a significant cost in area and power, due to the 1. J. Bainbridge and S. Furber, “Chain: A delay-insensi- use of two wires per bit and return-to-zero protocols. As tive chip area interconnect,” IEEE Micro, vol. 22, an example, the first generation of a mainstream QDI – switch with DI channels, ANoC, reports a 25% energy- no. 5, pp. 16 23, Sep./ Oct. 2002. “ per-flit overhead and 80% greater area compared to a 2. A. Lines, Asynchronous interconnect for synchro- ” – synchronous counterpart.3 However, it still achieves nous SoC design, IEEE Micro, vol. 24, no. 1, pp. 32 41, significant savings in total network power (85%) on low- Jan./ Feb. 2004. traffic telecom benchmarks, due to its inherent 3. Y. Thonnart, P. Vivet, and F. Clermidy, “A fully-asyn- asynchronous ability to exploit sparse activity. chronous low-power framework for GALS NoC Alternatively, several single-rail bundled-data integration,” in Proc. ACM/IEEE Des., Autom. Test Eur. asynchronous NoCs have been proposed, which Conf., 2010, pp. 33–38. incorporate relative timing constraints (RTCs), and show 4. T. Bjerregaard and J. Sparsoe, “A router architec- promise in overall cost metrics: coding efficiency, area, ture for connection-oriented service guarantees in power, and performance.4,5,6,7 However, the the MANGO clockless network-on-chip,” in Proc. development of automated CAD flows for these NoCs is ACM/IEEE Des., Autom. Test Eur. Conf., 2005, pp. especially challenging. Commercial CAD tools typically 1226–1231. support only absolute min/max delay constraints. In 5. R. Dobkin, R. Ginosar, and A. Kolodny, “QNoC asyn- contrast, asynchronous RTCs define the required chronous router,” Integr. VLSI J., vol. 42, no. 2, ordering of pairs of control and/or datapath delays (e.g., pp. 103–115, 2009. a control event must occur only after associated data is 6. M. Gibiluka, M. T. Moreira, F. G. Moraes, and fi valid), whose absolute values need not be de ned and N. L. V. Calazans, “BAT-Hermes: A transition-signal- may depend on later synthesis steps (gate mapping, ing bundled-data NoC router,” in Proc. IEEE Latin physical design). A basic iterative synthesis procedure Amer. Symp. Circuits Syst., 2015, pp. 1–4. 8 has been proposed, using synchronous CAD tools, but 7. M. Imai, T. V. Chu, K. Kise, and T. Yoneda, “The syn- its applicability is currently limited to small subsystems, chronous vs. asynchronous NoC routers: An apple- and no course of action is taken to ensure convergence to-apple comparison between synchronous and or to optimize quality of results in the general case. transition signaling asynchronous designs,” in Proc. Finally, while bundled-data NoCs have demonstrated ACM/IEEE Int. Symp. Netw.-on-Chip, 2016, pp. 1–8. benefits over synchronous NoCs in some cost metrics, 8. M. Gibiluka, M. T. Moreira, and N. L. V. Calazans, their limited optimization currently results in other “A bundled-data asynchronous circuit synthesis substantial overheads, especially in area6 and flow using a commercial EDA framework,” in Proc. performance.6,7 In addition, most bundled-data NoC Euromicro Conf. Digit. Syst. Des., 2015, pp. 79–86. research rarely goes beyond switch-level analysis. There

Second, asynchronous NoCs currently suffer from Main Contributions limited computer-aided design (CAD) tool support, This article aims at an inflection point in asynchronous due to a disconnect in timing models between clocked NoC design, which relies on two pillars. and asynchronous designs.

January/February 2021 IEEE Micro 71

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

First, it presents a new asynchronous switch archi- can be scaled to connect an arbitrary number of input tecture combining the high performance of two- port modules (IPMs) and output port modules (OPMs). phase, or “transition-signaling,” communication proto- The figure shows an expanded view of an IPM and an cols (i.e., with only one round-trip handshake per OPM. The interconnecting crossbar is enclosed in the transaction) with the coding efficiency of single-rail OPM schematic. The architecture implements worm- bundled-data encoding. In practice, the datapath con- hole routing; hence, packets are processed at the level sists of synchronous-style “bundles” of single wires of individual flow control units (flits). per data bit, along with associated req/ack handshak- ing signals that toggle only once per data transfer, Mousetrap Asynchronous Pipelines thereby enabling higher throughput (see the “Asyn- The switch builds extensively on a high-performance chronous Communication Channels: Protocols and asynchronous pipeline, Mousetrap,8 developed at Data Encoding” sidebar). Columbia University [see structure in Figure 1(b), and Second, we developed an automated synthesis and detailed comparisons and evaluation in the related place-and-route flow for the bottom-up hierarchical paper8]. It uses a two-phase protocol and single-rail implementation of bundled-data asynchronous NoCs, bundled-data encoding, where the latter provides leveraging mainstream industrial synchronous CAD nearly identical coding efficiency as a synchronous tools. datapath and is fully compliant with a standard-cell This combination of transition-signaling asynchro- design methodology. Each data item advances nous communication and single-rail bundled data has through the pipeline “elastically,” based on local condi- only rarely been used for on-chip interconnection net- tions, coordinated by a so-called “capture-pass” hand- works. In fact, the fundamental challenges are 1) to shaking protocol: single-latch registers are normally master the potential area and complexity overhead of transparent, only closing to protect data immediately two-phase asynchronous pipelines within the switch, after it enters a stage. Once data enters the next and 2) efficiently to enforce one-sided relative timing stage, the current register is reopened. Mousetrap constraints (RTCs) on all bundled datapaths (i.e., bun- uses simple control circuits (a single exclusive-NOR dling constraints), using automated commercial gate) and data registers (a single bank of level-sensi- CAD tools and without overloading the synthesis tive D-latches), with low area and delay overheads. engine (see the “Asynchronous NoC Architectures ” and Synthesis Tool Flows sidebar). In particular, for Asynchronous Switch Design correct operation, the data wires on each channel As shown in Figure 1(a), the switch’s buffering must always be valid and stable before the corre- includes: sponding request is observed at the receiver side. Using the proposed architecture and tool flow › to synthesize a complete 44 2-D mesh topology, a single Mousetrap input stage, decoupling the cycle time of the upstream link from the switch; an asynchronous two-phase bundled-data NoC for › the first time is shown to dominate a clock-gated an asynchronous circular FIFO on each output port; synchronous counterpart for ultra-low power sys- › tems7 under most operating conditions. When pro- a single Mousetrap internal stage, decoupling jected to the bandwidth requirements of a full HD the cycle time of the switch from that of the cir- cular FIFO; video playback application, the asynchronous NoC › exhibits latency savings up to 37% and total power an optional output Mousetrap stage, decoupling savingsupto45%. the cycle time of the circular FIFO from the Finally, in a direct comparison with a recent com- downstream link. mercial AMD synchronous router, using identical 14- nm FinFET technology, results confirm substantial Initially, arriving flits of a packet are stored sequen- benefits: 55% lower area, 28% lower latency, and tially in the input Mousetrap stage and transmitted reductions of 88% idle and 58% active power. through the IPM. The request bundling signal and associated head flit are speculatively broadcast ASYNCHRONOUS BUNDLED-DATA through the crossbar to every OPM.9 Concurrently, NOC SWITCH ARCHITECTURE the Routing Logic computes the actual target output Figure 1(a) shows the top-level view of our asynchro- port for the packet, based on its head flit, which is nous switch architecture, instantiated with 5 I/O ports then stored into a Memory Element. The latter asserts for a 2-D mesh topology. The switch is modular and a single Path Allocation Request to the selected OPM

72 IEEE Micro January/February 2021

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

FIGURE. 1. Proposed asynchronous switch architecture. (a) Schematic view. (b) Mousetrap asynchronous pipeline. (c) Routing logic. (d) Switch instance (2 planes, 2 VCs/plane).

for the entire packet transmission time, while other Hence, there is only a single roundtrip crossbar and link speculative requests are ignored. Once the allocation communication per flit, which enables higher through- request acquires the target OPM arbiter, the reserved put than the more common four-phase protocol. path from input to output channel for the remainder The output circular FIFO is latency-optimized and of the packet is normally transparent and free-flowing, comes with low area footprint. It uses a novel parallel unlike most synchronous designs, with latch registers microarchitecture, with Mousetrap-based single-latch closing only transiently after each flit arrives for flow architectural registers.9 control. Correct switch operation depends on several rela- A key OPM component is the four-way arbiter, tive timing constraints on datapath and control.9 mediating between competing input requests in con- While most constraints are typically satisfied for nor- tinuous time, without any reference clock. This com- mal operating conditions, two critical constraints are ponent, the only non-standard cell in the switch, likely to require additional synthesis effort: bundling incorporates three small analog mutexes that guaran- constraints between consecutive Mousetrap stages tee a correct resolution,9 though digital variants have [see Figure 1(b)], especially on links,8 and one on the been proposed with slightly degraded mean-time- routing logic [see Figure 1(c)], where a matched delay between-failure (MTBF).10 The arbiter has been gener- line ensures hazard-free operation.9 Margins enforced alized into a scalable family of N-way asynchronous on all relative timing constraints are also critical to tree arbiters, enabling the design of fully parameteriz- determining switch latency and throughput. able NM switches for generic topologies.11 The proposed architecture targets realistic switch Almost all communication in the design (e.g., intra- instances, such as the one in Figure 1(d), through the switch and inter-switch) uses two-phase signaling. support of physical planes (e.g., dedicated to different

January/February 2021 IEEE Micro 73

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

message types) and of virtual channels (VCs) in each immediately set to its final target, but to more relaxed plane (e.g., to relieve head-of-line blocking, or for qual- intermediate values (e.g., no margin, then 5%, 7%, and ity-of-service). finally 10% of the datapath delay to be matched), which For efficient implementation of VCs, control logic the synthesis and the P&R tool can more easily fulfill. overhead is minimized by using a simple replicated Finally, especially during top-level topology (i.e., copy of the crossbar for each VC, then multiplexing link) synthesis, the concurrent convergence of many their outputs onto a single physical stream/link via a control signals on the intermediate or target RTM can small flit-level asynchronous arbiter in each I/O inter- be optionally further accelerated by an engineering face [see Figure 1(d)].12 Buffer availability downstream change order (ECO), which selectively places small is identified on a VC-basis using a new asynchronous delay elements on violating control paths. credit-based scheme that has been optimized for Unlike prior approaches, this procedure not only throughput.13 In practice, this “lazy credit update” pol- guarantees functional correctness (i.e., all RTCs are icy improves performance, deferring unnecessary non- met) but also gains tight control over RTMs, i.e., it critical credit increment updates, which are queued generally prevents the delays on control lines from and take place only with the next credit decrement greatly exceeding the datapath delays to be matched, request. thus avoiding significant latency and cycle time deg- Finally, two useful additional capabilities have radations. Simultaneously, it makes the convergence been developed: 1) a comprehensive built-in self- in satisfying several bundling constraints, in parallel, testing framework,14 and2)anFPGA-basedswitch computationally affordable and effective, without design and CAD framework, targeting the overloading the synthesis engine. Vivado tool set.10 SWITCH-LEVEL ASSESSMENT IMPLEMENTATION TOOL FLOW Our bundled-data asynchronous switch, called TaBuLA fl (Transition-Signaling Bundled-Data Lightweight Asyn- An automated tool ow for bundled-data NoC imple- y mentation using commercial synchronous CAD tools chronous router), is synthesized using the proposed fl has also been developed. It targets synchronous- automated tool ow and compared to a leading syn- 7 ’ equivalent design flexibility, as well as the bottom-up chronous NoC switch, xpipes. The latter sstreamlined fl hierarchical synthesis of complete asynchronous architecture, which combines instantiation-time exibil- fi NoCs with arbitrary topologies.15 ity with silicon ef ciency, makes it a representative Figure 2 provides an overview of the complete tool benchmark for the requirements of the ultra-low power z 16 flow. Synthesis steps are in blue boxes, while place- embedded computing domain. &-route (P&R) steps are in red boxes. It is structured Both switches are instantiated with the same VC- fi into a first-stage flow for switch macros (steps 1–10), less con guration and have homogeneous architec- and a second-stage flow for top-level design of the tures. The synchronous design is synthesized for maxi- network as a whole (steps 11–16). Without lack of gen- mum performance, using a state-of-the-art clock- erality, the Design Compiler is used for logic gating methodology. It reserves 1 clock cycle for synthesis and the IC Compiler is used for P&R. switch traversal and 1 cycle for link traversal. The flow revolves around a detailed optimization Post-layout results are reported for an ultra-low methodology of relative timing margins (RTMs), struc- power 40-nm industrial technology. Since it does not tured as a nested loop. include asynchronous special cells, standard-cell First, all datapath delays are locally and individually equivalent implementations are used for TaBuLA. optimized (in Steps 3 and 8). Then, an inner loop is used as a baseline iterative procedure that fine-tunes control Performance Evaluation path delays associated with such locally-optimized Table 1(a) evaluates basic performance metrics. The datapath delays. In order to handle the scale and opti- leftmost columns show results for switch traversal mization challenge of complex switches and network only, i.e., from input port to arrival at the output buffer topologies, we optionally provide an outer nested loop that hits RTMs gradually. In particular, the RTM is not yThe acronym suggests “tabula rasa,” meaning “blank slate,” denoting a “fresh start,” without pre-existing ideas, which is a goal of considering asynchronous interconnect technology. AMD, Inc., Greg Sadowski and Weiwei Jiang, “Self-Timed zThrough the INoCs startup, recently acquired by Arteris Inc., Router with Virtual Channel Control,” US Patent #10,075,383 the research ideas of the xpipes framework have become (2016). part of one of the largest commercial NoC ventures.

74 IEEE Micro January/February 2021

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

FIGURE 2. Complete bottom-up hierarchical synthesis flow for bundled-data asynchronous NoCs. Steps enclosed in black boxes indicate timing optimization procedures. For simplicity, the figure shows the use of only the inner optimization loop (i.e., iterative convergence directly to the final RTM) for switch synthesis (steps #4-#5-#5a) and P&R (steps #9-#10-#10a), while nested optimi- zation loops (i.e., gradual convergence through intermediate relaxed margins) are used for timing convergence of the topology links (steps #14-#15-#15a-#15b-#15c), combined with ECO steps (#15b) that yield effective and fast timing optimization despite the large number of RTCs to handle at this stage.

January/February 2021 IEEE Micro 75

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

TABLE 1. Comparative 5 5 switch evaluation. (a) Performance. (b) Area and energy-per-flit.

inputs. Cycle times are nearly identical, but TaBuLA’s › TaBuLA, the proposed two-phase bundled-data payload latency is nearly half of the synchronous approach; latency, since body and tail flits avoid the control logic › ANOC, a QDI switch using delay-insensitive four- for routing and switch allocation needed for header phase communication channels; flits. However, the synchronous switch exhibits 19% › BAT-Hermes, an alternative two-phase bundled- lower head flit latency than TaBuLA. This penalty is due data framework. to the RTMs, the phase conversion circuit, additional The QDI ANOC,aninfluential design series gates for glitch-free operation of the routing logic, and from CEA-LETI, has gone through several genera- the propagation delay through the decoupling Mouse- tions of technology, implementation, and architecture. trap register. A high-quality recent generation17 is considered in When considering the complete switch and ideal Table 2. (i.e., zero-delay) link traversals together, the synchro- When looking at absolute numbers, a compara- nous switch increases latency at the coarser granular- ble TaBuLA switch, also using two physical channels, ity of full clock cycles while TaBuLA is more adaptive, performs consistently better than ANOC in all met- and only accounts for the actual link delay (in this case, rics: 11% lower latency, 20% higher throughput, 82% only through the Circular FIFO). This amplifies the lower area, and 83% lower energy-per-bit. payload latency overhead of xpipes (from 94% to 108%) – After technology and threshold voltage normal- and reverses its comparative head latency (from 19% x ization, latencies are roughly comparable, but TaBuLA to þ19%). The impact of realistic link delays will be exhibits roughly 10% higher throughput, despite its assessed later in the network-level analysis. earlier stage of tool development. ANOC largely off- sets the performance penalty of its four-phase com- Cost Analysis munication protocol through deep pipelining of its As listed in Table 1(b), despite the architectural homo- completion detection logic, and using the latest gener- fl geneity, xpipes consumes 15% more area than TaBuLA. ation of its timing optimization tool ow, but is none- Table 1(b) also reports total energy-per-flit theless unable to close the gap with the two-phase results (static and dynamic), showing a synchronous TaBuLA. fi overhead ranging from 9% to 21% for continuous Signi cant advantages, however, appear in the injection, and from 54% to 69% for moderate injec- remaining metrics, after normalization, where TaB- tion. TaBuLA’s energy-per-flit is largely insensitive to uLA has 72% lower energy-per-bit and 52% lower the traffic scenario, since TaBuLA burns energy area. These results clearly correlate to its use of mainly for productive switching activity and not for bundled-data encoding and two-phase handshaking idle resources. In contrast, synchronous energy-per- protocols. ’ flit increases as the injection rate decreases, since In contrast to TaBuLA,someofANOC s design- fixed clock-tree power contributions are divided space decisions are oriented to extreme robustness: over a lower trafficvolume. resilience to arbitrary bit skew, and to high physical

xThe fanout-of-4 (FO4) delay metric (estimated from technol- COMPARISON WITH STATE OF ogy databooks) is used for performance normalization. The THE ART scaling ratio of feature size is applied for area and energy- Table 2 compares the performance, area and energy- per-bit, assuming identical supply voltages. When comparing multi-Vth ANOC with standard-Vth TaBuLA technology, per-bit of recent asynchronous NoC switches using scaled area numbers are slightly pessimistic for ANOC, while three state-of-the-art asynchronous design styles: ANOC’s scaled energy-per-bit is optimistic.

76 IEEE Micro January/February 2021

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

TABLE 2. Comparison of asynchronous NoC switches.

and operational variability, as well as to simplify flit width equalization for direct comparison. Our large-scale physical design. Such an approach can projections indicate that TaBuLA’s savings over BAT- result in over-design for many practical systems. Hermes are still as large as 32% for area, 65% for TaBuLA’s less conservative style leads to major latency, from 18% to 38% for energy-per-bit, with 70% overall cost benefits, as shown above, while still higher throughput. These benefits are attributed to leaving the designer with the flexibility to locally TaBuLA’s lightweight control logic and more efficient fine-tune timing margins for the requirements of the timing optimization tool flow. system at hand. It is also noteworthy that ANOC crossbars, using NETWORK-LEVEL ASSESSMENT delay-insensitive codes, are much more wire-intensive, For network-level evaluation, one complete xpipes- with roughly twice as many wires as in TaBuLA. Hence, based synchronous 44 2-D-mesh NoC (with clock TaBuLA is better suited for environments requiring gating) and two asynchronous 44 2-D-mesh TaBuLA high-radix switches, such as irregular topologies, 3-D NoCs are laid out in identical 40 nm technology. The stacking or neuromorphic computing. Finally, Table 2 presents an unscaled comparison to BAT-Hermes,18 a 16-bit two-phase bundled-data We optimistically assume that latency and throughput are fl fl preserved as it width is increased from 16 to 32 bits; while switch. Despite its doubled it width, TaBuLA has 58% for area, we project a 60% overall increment, from similar flit lower area, 78% lower latency, and 2.8 equivalent width scaling experiments on TaBuLA. Supply voltage is speed (in MFlit/sec), with comparable energy-per-bit. scaled through the popular Alpha-Power Law MOS model. FO4 and dimension-scaling ratios are used to normalize per- To normalize results, in addition to technology formance and area/energy-per-bit to the same process node, scaling, BAT-Hermes also requires supply voltage and respectively.

January/February 2021 IEEE Micro 77

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

FIGURE 3. Comparative network evaluation. (a) Load curves. (b) Power consumption.

two asynchronous variants are with (pipelined)andwith- better saturation rate in an extreme region that typi- out (unpipelined) an additional output Mousetrap stage cally lacks practical interest. decoupling the cycle time of circular FIFOs from the downstream links. The interswitch link length is set to 1 Power Analysis mm, reflecting typical tile sizes in ultra-low power proc- Total power results are illustrated in Figure 3(b). At low essing platforms. Tiles are modeled as hard non-routable injection rates, the power saving capability of both 15 obstructions. asynchronous NoCs (roughly 40 mW) is out-of-reach even for the clock-gated synchronous counterpart. Performance Analysis With higher traffic, differentiations in design over- The synchronous NoC achieves timing closure at head between xpipes and pipelined-TaBuLA result in 1 GHz without any guardband. In contrast, the asyn- distinct trends for short and long packets. chronous NoCs are conservatively synthesized with a The power of the synchronous switch is sensitive 10% RTM; hence, the asynchronous results are more to its much larger arbitration overhead, clock-tree pessimistic. power and more expensive FSM-based flow control. Load curves are reported in Figure 3(a) for different For the latter, though xpipes handles back-pressure packet lengths with uniform random traffic. When through a standard low-overhead stall/go protocol, it compared with the synchronous design, pipelined- requires at least double buffers to avoid dropping TaBuLA exhibits its full potential when the packet incoming packets in case of a stall from the down- length is long enough, e.g., 20 flits in this study. Here, stream receiver. performance is dominated by the flow-through opera- In contrast, TaBuLA has built-in asynchronous sup- tion capability of payload flits, where the asynchro- port for flow control through its simple req/ack signal- nous switches show substantial benefits, achieving ing protocol, hence a single Mousetrap buffer is 36% lower zero-load latency (45.56 ns for synchro- sufficient for correct operation. TaBuLA’s power over- nous, 28.92 ns for asynchronous) and entering the sat- head is mainly dominated by the four Mousetrap regis- uration region at an injection rate that is 33% higher, ters that each flit must traverse per switch, as opposed over the synchronous network. to two registers in xpipes, while its lightweight arbitra- In contrast, when short (e.g., 3-flit) packets are tion contributes only negligible overhead. used, pipelined-TaBuLA’s performance is mainly deter- On balance, with 3-flit packets, pipelined-TaBuLA mined by head flits. The latency of asynchronous head shows consistent improvement over xpipes, with 28%, flits is greater than in the synchronous design, due to 17%, 12% and 10% lower total power, respectively, at the effects of link handshaking, relative timing mar- injection rates of 100, 200, 300 and 400 MFlit/port/s, gins, and propagation through additional Mousetrap before the onset of saturation effects. stages. Still, the asynchronous NoC provides a 9% With 20-flit packets, however, the overall contribu- lower zero-load latency for delivering an entire packet tion of arbitration is smaller, leading to a more gradual (10.79 ns for synchronous, 9.83 ns for asynchronous), synchronous curve. Here, pipelined-TaBuLA out-per- which improves to 33% lower latency at an injection forms xpipes up to a power break-even point at 65% of rate of 500 MFlit/port/s, after which both solutions the maximum injection rate, beyond which the satura- begin saturating. Finally, the synchronous NoC has a tion of the synchronous NoC begins.

78 IEEE Micro January/February 2021

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

Asynchronous total power savings are significant, and power/performance monitoring and control in ranging from 17% to 60% over the synchronous ver- recent high-end processor and graphics products. sion, as traffic decreases from near break-even (200 This evaluation is the first reported direct compari- MFlit/port/s) to lower injection rates (100 MFlit/port/ son of asynchronous and commercial synchronous s). Above the break-even point, xpipes saves from 10% NoC switches in identical advanced (14-nm FinFET) to 8% when moving up from 300 MFlit/port/s to the technology. start of its saturation region. Further asynchronous To match CommSw’s microarchitecture, TaBuLA improvements are anticipated, since the current was expanded into a 2-plane, 2-VC-per-plane implemen- design is not fully optimized for switching activity in tation [see Figure 1(d)]. The three-cycle CommSw was yy the routing logic and circular FIFOs. synthesized for the target speed of 1 GHz. Finally, the unpipelined-TaBuLA NoC provides a The asynchronous switch was implemented using radical performance-power tradeoff. While it enters synchronous design and validation tools of the indus- the saturation region at an injection bandwidth that is trial partner, but limiting the reinstrumentation effort roughly one third of that of the synchronous NoC for of the stable existing flow as much as possible (e.g., our both packet lengths, it offers substantial power sav- gradual convergence option could not be used).13 Due ings, ranging from 57% to 45% for 3-flit packets, and to the use of advanced commercial technology, only from 80% to 50% for 20-flit packets, when moving post-synthesis results can be reported, but actual from injection rates of 100 MFlit/port/sec to satura- parasitics of standard cells were nonetheless imported tion. These results make it attractive for low-end for accurate power analysis. embedded systems. The asynchronous router has 55% lower area, 28% lower latency, and 88% and 58% savings in idle and active power, respectively. Most of these savings Realistic Traffic are due to the use of single-latch-based Mousetrap Projected bandwidth requirements are also evaluated registers (with small area footprint, and low energy for a full HD video playback application for high-end and critical-path latency), lack of global clock distribu- 19 mobile devices, with 19201080 pixels at 60 frames/ tion, and on-demand activation. s. The application has 19 communication flows with average bandwidth ranging from 0.1 to 500 MB/s. Even injecting at the maximum data rate of 125 MFlit/s from THIS EVALUATION IS THE FIRST ’ each switch s local port (although demanded by only 5 REPORTED DIRECT COMPARISON OF fl of the 19 ows in the original application), and adding ASYNCHRONOUS AND COMMERCIAL the head flit overhead, the asynchronous NoC is SYNCHRONOUS NOC SWITCHES IN largely dominating. With 3-flit and 20-flit packets, IDENTICAL ADVANCED (14-NM power savings are 18% and 45%, respectively, while latency savings are 22% and 37%. FINFET) TECHNOLOGY. Finally, TaBuLA can handle the communication challenges posed by future high-performance data analytics in Edge computing platforms. In particular, it can sustain the communication bandwidth require- ments of a pool of 16 inference engines, such as the CONCLUSIONS NVDLA deep neural network accelerator,20 with power Emerging computing architectures call for increas- savings of 18% and 10% over xpipes, respectively, as the ing levels of asynchrony during system-level inte- number of MAC units per engine increases from 32 to gration. This article proposes a novel asynchronous 64. In addition, TaBuLA obtains latency savings of 22% interconnect technology, TaBuLA, which can fulfill and 40%, respectively. this requirement with cost metrics (area, energy- per-bit) that are largely out-of-reach for mainstream synchronous counterparts, while preserving or VALIDATION IN AN INDUSTRIAL improving performance depending on the operating ENVIRONMENT conditions. A prototype of the pipelined-TaBuLA switch was imple- mented at AMD Research and compared directly to a yyCommSw also has functionality for error detection and con- commercial synchronous switch (hereafter named figuration, which contributes only a 1%–4% area and power CommSw),13 used to handle system-level configuration increase, with negligible performance impact.

January/February 2021 IEEE Micro 79

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

ACKNOWLEDGMENTS 11. G. Miorandi, D. Bertozzi, and S. M. Nowick, “ This work was supported in part by the “Fondo Increasing impartiality and robustness in high- ” Giovani” Ph.D. program (Italian Government), in part performance N-way asynchronous arbiters, in Proc. by an FIR Grant (University of Ferrara), in part by the IEEE Int. Symp. Asynchronous Circuits Syst.,2015, National Science Foundation (NSF) under Grants pp. 108–115. CCF-1219013 and CCF-1527796, and in part by AMD 12. G. Miorandi, A. Ghiribaldi, S. M. Nowick, and under a grant from the DOE Exascale Program. This D. Bertozzi, “Crossbar replication vs. sharing for virtual work was completed while Prof. Burleson was at channel flow control in asynchronous NoCs: A Advanced Micro Devices, Inc. comparative study,” in Proc. IFIP/IEEE Conf. Very Large Scale Integr., 2014, pp. 1–6. REFERENCES 13. W. Jiang, D. Bertozzi, G. Miorandi, S. M. Nowick, “ 1. H. Esmaeilzadeh, E. Blem, R. St. Amant, W. Burleson, and G. Sadowski, An asynchronous NoC router in a 14nm FinFET library: Comparison to K. Sankaralingam, and D. Burger, “Dark silicon and an industrial synchronous counterpart,” in Proc. the end of multicore scaling,” IEEE Micro,vol.32, ACM/IEEE Des. Autom. Test Eur. Conf.,2017, no. 3, pp. 122–134, May-Jun. 2012. – 2. R. G. Kim et al., “Imitation learning for dynamic VFI pp. 732 733. “ control in large-scale manycore systems,” IEEE Trans. 14. G. Miorandi, A. Celin, M. Favalli, and D. Bertozzi, A built- Very Large Scale Integr. (VLSI) Syst. , vol. 25, no. 9, in self-testing framework for asynchronous bundled- ” Proc. pp. 2458–2471, Sep. 2017. data NoC switches resilient to delay variations, in ACM/IEEE Int. Symp. Network-on-Chip – 3. J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and , 2016, pp. 1 8. 15. G. Miorandi, M. Balboni, S. M. Nowick, and G. Reinman, “Accelerator-rich architectures: D. Bertozzi, “Accurate assessment of bundled-data Opportunities and progresses,” in Proc. ACM/IEEE Des. asynchronous NoCs enabled by a predictable and Autom. Conf.,2014,pp.1–6. fi fl ” 4. F. Akopyan et al., “TrueNorth: Design and tool flow of a ef cient hierarchical synthesis ow, in Proc. IEEE Int. – 65 mW 1 million neuron programmable neurosynaptic Symp. Asynchronous Circuits Syst., 2017, pp. 10 17. “ chip,” IEEE Trans. Comput.-Aided Des. Integr. Circuits 16. J. J. Lecler and G. Baillieu, Application-driven network- fi Syst., vol. 34, no. 10, pp. 1537–1557, 2015. on-chip architecture exploration and re nement for a ” 5. M. Davies et al., “Loihi: A neuromorphic manycore complex SoC, Springer Des. Autom. Embedded Syst., – processor with on-chip learning,” IEEE Micro, vol. 38, no. vol. 15, no. 2, pp. 133 158, 2011. et al. “ 1, pp. 82–99, Jan.-Feb. 2018. 17. P. Vivet , A 4x4x2 homogeneous scalable 3D 6. M.Krstic,E.Grass,F.K.Gurkaynak,€ and P. Vivet, “Globally network-on-chip circuit with 326 MFlit/s 0.66pJ/bit ” asynchronous, locally synchronous circuits: Overview and robust and fault tolerant asynchronous 3D links, – outlook,” IEEE Des. Test Comput., vol. 24, no. 5, pp. 430–441, IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 33 49, Sep.-Oct. 2007. Jan. 2017. 7. S. Stergiou, F. Angiolini, S. Carta, L. Raffo, D. 18. M. Gibiluka, M. T. Moreira, F. G. Moraes, and “ Bertozzi, and G. De Micheli, “xpipes lite: A synthesis N. L. V. Calazans, BAT-Hermes: A transition-signaling ” oriented design library for networks on chips,” in bundled-data NoC router, in Proc. IEEE Latin Amer. – Proc. ACM/IEEE Des. Autom. Test Eur. Conf.,2005, Symp. Circuits Syst., 2015, pp. 1 4. pp. 1188–1193. 19. H. F. Tatenguem, D. Ludovici, A. Strano, D. Bertozzi, and “ 8. M. Singh and S. M. Nowick, “MOUSETRAP: High-speed H. Reinig, Contrasting multi-synchronous MPSoC design fi transition-signaling asynchronous pipelines,” IEEE styles for ne-grained clock domain partitioning: The full- ” Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 6, HD video playback case study, in Proc. ACM Int. – pp. 684–698, Jun. 2007. Workshop Network-on-Chip Archit., 2011, pp. 37 42. 9. A. Ghiribaldi, D. Bertozzi, and S. M. Nowick, “A 20. Memory bandwidth data, http://nvdla.org/primer.html transition-signaling bundled data NoC switch architecture for cost-effective GALS multicore DAVIDE BERTOZZI is currently a Professor with the Depart- systems,” in Proc. ACM/IEEE Des. Autom. Test Eur. ment of Engineering, University of Ferrara. His research Conf., 2013, pp. 332–337. focuses on chip-scale interconnect technology and on its 10. K. Bhardwaj, P. Mantovani, L. P. Carloni, and capability to enable new system architectures. Bertozzi has a S. M. Nowick, “Towards a complete methodology for Ph.D. in electrical engineering from the University of Bologna. synthesizing bundled-data asynchronous circuits on He is a member of IEEE, ACM, and the HiPEAC European Net- FPGAs,” in Proc. IEEE Int. Symp. Low-Power Electron. work-of-Excellence on High-Performance Embedded Archi- Des., 2019, pp. 1–6. tectures and Compilation. Contact him at [email protected].

80 IEEE Micro January/February 2021

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply. GENERAL INTEREST

GABRIELE MIORANDI is currently a digital circuit designer KSHITIJ BHARDWAJ is a research staff member at Lawrence with the High-Performance Data Center Solutions Group, Livermore National Laboratory, before which he was a Postdoc- Microchip, Milan, Italy. Miorandi has a Ph.D. in electrical engi- toral Research Fellow with the Computer Architecture and VLSI neering from the University of Ferrara, where he performed the group at Harvard University. His research interests include work for this article. The focus of his research was on architec- asynchronous networks-on-chip, GALS systems, many-acceler- tural optimization and synthesis methods for two-phase bun- ator SoCs for AI applications, and system-level optimization dled-data asynchronous network-on-chip design. Contact him using machine learning. Bhardwaj has a Ph.D. in computer sci- at [email protected]. ence from Columbia University, where he performed the work for this article. Contact him at [email protected]. ALBERTO GHIRIBALDI is a co-founder of ArzaMed s.r.l., a web software company in Rimini, Italy, where he is responsible for product research and development, as well as server cloud WEIWEI JIANG is currently an Architecture and RTL Design infrastructure. Ghiribaldi has a Ph.D. in electrical engineering Engineer for networking chip development with Cloud. from the University of Ferrara, where he performed the work His research interests include asynchronous and mixed- for this article. The focus of his research was on cost-effective timing digital circuits, high-performance and low-power asyn- asynchronous NoC switch architectures using two-phase chronous NoCs and GALS systems, EDA tools for ASICs and bundled data. Contact him at [email protected]. FPGAs. Jiang has a Ph.D. in computer science from Columbia University. He performed the work for this article when intern- WAYNE BURLESON has been a Professor of Electrical ing at AMD Research during his Ph.D. Contact him at and Computer Engineering, University of Massachusetts [email protected]. Amherst, since 1990. From 2012 to 2017, he was a Senior Fellow with AMD Research in Boston, during which he per- formed the work in this article. His primary research is in the general area of VLSI, including circuits and CAD for clocking STEVEN M. NOWICK is currently a Professor of Computer Sci- and low-power, and security engineering. He has authored ence at Columbia University, and a former chair and co-founder more than 200 refereed publications in these areas and is a of the Computer Engineering Program. He was founding chair Fellow of IEEE for contributions in integrated circuit design of the “Computing Systems for Data-Driven Science” Center in and signal processing. Contact him at [email protected]. Columbia’s Data Science Institute and a co-founder of the IEEE “Async” Symposium. His research interests include asynchro- GREG SADOWSKI is currently a Technical Fellow with nous and GALS digital systems, CAD optimization, scalable Advanced Micro Devices, Boxborough, MA, USA, where he is high-performance and low-power networks-on-chip, and involved in the performance optimization of both GPU and ML embedded and ultra-low energy systems. He has collaborated technologies. His research interests include continued perfor- on asynchronous designs with AMD, Boeing, IBM, and NASA mance and power improvements of machine intelligence and Goddard. Nowick has a Ph.D. in computer science from Stan- GPU systems. He holds 48 U.S. patents. He is a senior member ford University. He is a Fellow of IEEE. Contact him at of IEEE. Contact him at [email protected]. [email protected].

January/February 2021 IEEE Micro 81

Authorized licensed use limited to: Columbia University Libraries. Downloaded on January 28,2021 at 20:06:53 UTC from IEEE Xplore. Restrictions apply.