Asynchronous : A Systems Perspective

Rajit Manohar Computer Systems Lab, Cornell http://vlsi.cornell.edu/

Approaches to computation

Discrete Time Continuous Time

Digital, Digital, synchronous Discrete Value asynchronous logic logic

Switched-capacitor General analog Continuous Value analog computation

Asynchronous logic

clock receiver sender

data signal value time

sampling window receiver receiver sender data 0 sender data req data 1 ack ack

data 1 X X X X data data 0 request

acknowledge acknowledge

Explicit versus implicit synchronization

A B C A B C

C1 C1 A1 B1 A1 B1

C2 A2 A2 B2 C2 B2 time time

C3 C3 B3 A3 B3 A3 C4

B4 B4 C4 A4 A4

Synchronous Asynchronous computation computation

Differences at a glance

Asynchronous Synchronous Continuous time computation. Discrete time computation. Observed delay is the average over Observed delay is the maximum over possible circuit paths. possible circuit paths.

Local switching from data-driven Global activity from clock-driven computation. Can lead to low-power computation. Low power requires operation. careful .

Throughput determined by device Throughput determined by device speed. (Margins needed for some speed and additional operating circuit families.) margins.

Data must be encoded, requiring Unencoded data acceptable due to additional wires for signals. global clock network.

A little bit of (biased) history…

Macro modular systems Binary addition Switching Illiac II (UIUC/Muller) (Clarke, Molnar, Stucki,…) (von Neumann) theory (Miller) Atlas, MU5 (Manchester)

1940 1950 1960 1970

Flow tables System Timing (Seitz, in (Huffman) Mead/Conway)

Trace theory (Snepscheut) HiAER (Cauwenberghs) Syntactic compilation ARM with Pipelined FPGAs FACETS TrueNorth (Martin, Rem) caches (Furber) (Manohar) (Heidelberg) (IBM/Cornell)

1980 1990 2000 2010

Micro pipelines Asynchronous 17FO4 UltraSparc IIIi 10G Ethernet (Sutherland) CPU (Martin) MIPS CPU mem controller switch (Martin) (Sun) (Fulcrum) General hazard-free synthesis Commercial Event-driven SpiNNaker AER (Mead lab) 80C51 (Phillips) CPU (Manohar) (Furber) Sensory systems (retina, cochlea, etc.) Neurogrid (Boahen)

I. Exploiting the gap between average and max delay

1 1 0 0 0

1 0 1 1 0 0 + 0 1 1 0 1 0 1 0 0 0 1 1 0

• Slow part: computing carries ❖ … but only in the worst-case!

Theorem [von Neumann, 1946]. The average-case latency for a ripple carry binary adder is O(log N) for i.i.d. inputs

A.W. Burks, H.H. Goldstein, J. von Neumann. Preliminary discussion of the logical design of an electronic computing instrument. (1946)

I. Exploiting the gap between average and max delay

• Standard adder tree ❖ O(log N) latency • Hybrid adder tree ❖ Tree adder + ripple adder

Theorem [Winograd, 1965]. The worst-case latency for binary addition is Ω(log N) Theorem. The average-case latency of the hybrid asynchronous adder is O(log log N) for i.i.d. inputs Theorem. The average-case latency of the hybrid asynchronous adder is optimal for any input distribution

S.O. Winograd. On the time required to perform addition. JACM (1965) R. Manohar and J. Tierno. Asynchronous parallel prefix computation. IEEE Transactions on (1998)

II. Exploiting data-driven power management

• Low power FIR filter ❖ External, synchronous I/O ❖ Monitor occupancy of FIFOs

❖ Speed up/slow down using DC/DC adaptive power supply converter Vdd

• Break-point add/multiply ❖ Detect when top half of operand is required FIR filter ❖ Dynamically switch between full width and half width arithmetic

L. Nielsen and J. Sparso. A Low-power Asynchronous Datapath for a FIR filter bank. Proc. ASYNC (1996)

0 1 0 1 0 1 II. Exploiting data-driven power management

000 0001 • Width-adaptive numbers 8−bits per VDW block 0 1 0 1 0 1 +000··· 0001 ❖ Compress leading 0’s and 1’s000··· 0010

··· 128 − bit datapath receiver 000 0001 0 1 sender +000··· 0001 v/s +01 000··· 0010 0 10 ···

SPEC: 124.m88ksim SPEC: 126.gcc 100 0 100 14 47 0 1 Number: 1.4 10 2 90 +01 90 ⇥ ⇤ Average:0 8.4 Average: 10.0 80 0 10 80 70 70

60 60 0 receiver

50 50 sender 0 Average: 10.9 Average: 11.8 40 Average: 8.8 40 Average: 10.5 30 30

20 20

Cumulative Distribution Function (%) 10 no compaction Cumulative Distribution Function (%) 10 no compaction simple compaction simple compaction full compaction 5 full compaction 0 0 Number: 29 2 5 10 15 20 25 30 5 10 15 20 25 30 Operand Widths ⇤Operand Widths

SPEC: 129.compress R. Manohar. Width-Adaptive Data Word SPEC: 132.ijpeg 100 Architectures. Proc. ARVLSI100 (2001)

80 Average: 10.8 80 Average: 10.3

60 60

40 Average: 16.3 40 Average: 12.6

20 Average: 13.4 20 Average: 10.8

Cumulative Distribution Function (%) no compaction Cumulative Distribution Function (%) no compaction simple compaction simple compaction full compaction full compaction 0 0 5 10 15 20 25 30 5 10 15 20 25 30 Operand Widths Operand Widths III. Exploiting elasticity of pipelines

f g h

Asynchronous computation is* robust to changes in circuit- level pipelining!

f g h

*R. Manohar, A. J. Martin. Slack Elasticity in Concurrent Computing. Proc. Mathematics of Program Construction (1998)

III. Exploiting elasticity of pipelines

• Asynchronous FIR filter • Data arrives at variable rates • Data rates are predictable

• Fixed amount of time for async computation 1 2 3 9 • Variable number of clock cycles for the fixed time budget

M. Singh, J.A. Tierno, A. Rylyakov, S. Rylov, S.M. Nowick. An adaptively pipelined mixed synchronous-asynchronous digital FIR filter chip operating at 1.3 GHz. Proc. ASYNC (2002)

III. Exploiting elasticity of pipelines

• FPGA throughput low due to overhead of LB CB LB CB programmable interconnect

CB SB CB SB • ~3x improvement in peak throughput by CB CB transparent interconnect pipelining LB LB

CB SB CB SB

J. Teifel and R. Manohar. Highly pipelined asynchronous FPGAs. Proc. FPGA (2004)

IV. Exploiting timing robustness

Asynchronous FPGA Test Data 1200 Process: TSMC 0.18um 1000 Nominal Vdd: 1.8V 77K 12K 800 294K 600

400 400K

Throughput (MHz) 200 Xilinx Virtex

0 0.5 1 1.5 2 2.5 Voltage (V)

D. Fang, J. Teifel, and R. Manohar. A High-Performance Asynchronous FPGA: Test Results. Proc. FCCM (2005)

IV. Exploiting timing robustness

• “GasP” logic family Self-reset Track • 6-10 FO4 reset Single-track

w control wire

• No margins on control RA(i) RA(i+1) signals

Track set Wait for • Standard datapath L set, R reset

• Latest 40nm chip operates Datapath >6 GHz driver

w

I. Sutherland, S. Fairbanks. GasP: a minimal FIFO control. Proc. ASYNC (2001)

V. Exploiting low noise and low EMI

time domain

frequency domain

0 100 200 300 400 MHz 0 100 200 300 400 MHz Synchronous 80C51 Asynchronous 80C51 (Phillips) (Phillips)

J. Kessels, T. Kramer, G. den Besten, A. Peeters, V. Timm. Applying Asynchronous Circuits in Contactless Smart Cards. Proc. ASYNC (2000)

VI. Exploiting continuous-time information processing

• Continuous-time processing

• Automatic adaptation to input bandwidth Nyquist sampling Level-crossing sampling

• No aliasing from sampling process

B. Schell, Y. Tsividis. A clockless adc/dsp/dac system with activity- dependent power dissipation and no aliasing. Proc. ISSCC (2008)

VI. Exploiting continuous-time information processing

• Address-event representation in neuromorphic systems ❖ “Time represents itself” ❖ Spike timing and coincidence for information processing • Asynchronous logic ❖ Temporal resolution is decoupled from throughput ❖ Automatic power management • Projects ❖ Silicon retinas, cochleas, … ❖ More recent: FACETS, HiAER, Neurogrid, neuroP, SpiNNaker, TrueNorth

A note on oscillations in asynchronous logic

• Well-known result in asynchronous circuit theory ❖ Signal transitions: look at the ith occurrence of transition t ❖ time ( t, i ) = the actual time of this event

❖ time 0(t, i)=f(t)+p⇤ i ⇥ • Then: time(t, i) time0(t, i)

• More recently: stronger periodicity under more general conditions

J. Magott. Performance evaluation of concurrent S.M. Burns, A.J. Martin. Performance Analysis and systems using petri nets. IPL (1984) Optimization of Asynchronous Circuits. Proc. ARVLSI (1990)

W. Hua and R. Manohar. Exact timing analysis for concurrent systems. To appear, work-in-progress session, DAC (2016)

Asynchronous logic in TrueNorth

• Spikes trigger activity… ❖ … so they should be asynchronous Token Control

• Neurons leak periodically… Scheduler

❖ … so add a periodic event to the system (the clock)

At the external timing “tick”: SRAM • Update each neuron, delivering spikes plus

any local state update Neuron Router Router • If the neuron spikes, send output spike to destinations, with specified delay Spikes travel through asynchronous network • Receiver buffers spikes until it is time to deliver them Digital, asynchronous logic

Asynchronous logic in TrueNorth

• Spike communication network ❖ Links are a shared resource ❖ Links might be contended

• Spike delivery times can differ based on traffic ❖ Different runs of the same network might produce different spike patterns

• Instead, we use a spike scheduler ❖ Guarantees deterministic computation ❖ … and hence, repeatability

Summary

• Asynchronous logic has a long history

• Some key properties exploited in projects ❖ Gap between average and maximum delay for a computation ❖ Automatic data-driven power management ❖ Elastic pipelining ❖ Robustness to timing uncertainty/variation ❖ Reduced EMI and noise ❖ Continuous-time information representation

Acknowledgments

• The “async” community • Students: ❖ Ned Bingham, Wenmian Hua, Sean Ogden, Praful Purohit, Tayyar Rzayev, Nitish Srivastava ❖ John Teifel, Clinton Kelly, Virantha Ekanayake, Song Peng, David Biermann, David Fang, Chris LaFrieda, Filipp Akopyan, Basit Riaz Sheikh, Benjamin Tang, Nabil Imam, Sandra Jackson, Carlos Otero, Benjamin Hill, Stephen Longfield, Jonathan Tse, Robert Karmazin • Collaborators: ❖ General: David Albonesi, Sunil Bhave, Francois Guimbretiere, Amit Lal, Yoram Moses, Ivan Sutherland, Yannis Tsividis ❖ Neuromorphic: John Arthur, Kwabena Boahen, Tobi Delbruck, Chris Eliasmith, Giacomo Indiveri, Paul Merolla, Dharmendra Modha • Support: AFRL, DARPA, IARPA, NSF, ONR, IBM, Lockheed, Qualcomm, Samsung, SRC