Neuromorphic Computing, Sensing, and Communication in the Post Moore Era

Siddharth Joshi Intelligent Microsystems Laboratory [email protected] 326B Cushing University of Notre Dame What is possible?

Microwasp with 7500 neurons of which 3000 process input from the eyes. Single celled Protozoa

200 μm. How can we get there?

How can we build an artificial systems that are robust, intelligent, and energy efficient?

Codesign hardware and algorithms

§ New sensing and acquisition § Rethink how we build computers § Create new efficient algorithms that mimic intelligence How can we get there?

How can we build an artificial systems that are robust, intelligent, and energy efficient?

Codesign hardware and algorithms

§ New sensing and signal acquisition § Rethink how we build computers § Develop hardware-aware algorithms for machine intelligence Adaptive Low-Power Sensory System

Digital Sensory Processing

Analog Digital

Outputs Sensor AFE ADC DSP

Adaptation

• General-purpose • Precision and dynamic range are limited by ADC • Overengineered @ 12b, 45nm: E(ADC) ~ 200 pJ / sample @12b, 45nm: E(DSP) ~ 2pJ/MAC 2x energy (E) for every additional ADC bit ADC Energy Trends

Historical Energy Trends for ADC /Sample pJ

ADC Energy is Exponential With Resolution

ENOB (bits) MSP Energy Bounds

-10 Input signal Linear transform Processed signal 10 ADC ADC+aMVM processing gain 60 dB x W y ADC+aMVM processing gain 40 dB ADC+aMVM processing gain 20 dB Signal acquisition W x y 10-15 = FeatureFeature eextractionxtraction

System Energy (J) Employing MSP can lead to Classification x xxx 1000X energy efficiency x o o -20 ooo 10 0 20 40 60 80 100 System Dynamic Range (dB)

Continuous-time, capacitive, mixed-signal processing (MSP) can implement linear transforms with extreme energy-efficiency

Joshi et al – CICC 2017 Dot Product Unit Processor is composed of arrayed Dot Product Units (DPUs) Single DPU Channel

W1 W1 W1 W1 W1 W1 W1 W1 1 2 3 4 5 6 7 8 DPU 1

W2 W2 W2 W2 W2 W2 W2 W2 1 2 3 4 5 6 7 8

W3 W3 W3 W3 W3 W3 W3 W3 1 2 3 4 5 6 7 8

W4 W4 W4 W4 W4 W4 W4 W4 1 2 3 4 5 6 7 8

W5 W5 W5 W5 W5 W5 W5 W5 1 2 3 4 5 6 7 8

W6 W6 W6 W6 W6 W6 W6 W6 1 2 3 4 5 6 7 8

W7 W7 W7 W7 W7 W7 W7 W7 1 2 3 4 5 6 7 8

W8 W8 W8 W8 W8 W8 W8 W8 1 2 3 4 5 6 7 8

DPU 8

Chip

Joshi et al – ISSCC 2017 Dot Product Unit

W1,j W8,j

Aj yj 14b 14b

Parallel i=8 accumulate at VGA input. Nested Thermometer yj = Aj xiWi,j Multiplying DAC i=1 Joshi et al – ISSCC 2017 MIMO Communication

Constellation of the mixture 16-QAM+64-QAM 0.5 0.5

0 0 Q (V) Q (V)

-0.5 -0.5 -0.5 0 0.5 -0.5 0 0.5 64-QAM I (V) I (V)

16-QAM Mixture of quadrature amplitude modulation (QAM) with indistinguishable spectra Joshi et al – ISSCC 2017 MIMO Communication

Spatial Filtering separates the mixture

Constellation

64-QAM resolved 16-QAM resolved RMS EVM 2.9% RMS EVM 3.1%

Joshi et al – ISSCC 2017 Experimental Setup

Experimental setup: Two sinusoids, non-line-of- sight environment with multipath.

Task: Maximize P1 / P2 Measured Mixture Isolated Signal Baseband Signal -30 -30 TEST board 0 Reflecting path -40 -40 Signal to Interferer Signal to Interferer Antennae -20 Ratio increased to LO Signal -50 Ratio at baseband -50 41 dB input -24 dB board TX0 -40 -60 -60 Signal Gen. E4438C λ/4 DUT -60 -70 -70 Amplitude (dBm) FM Amplitude (dBm) HRM-MAC IC fLO_RF Amplitude (dBm) = 2.4 GHz -80 -80 c -80 0 li RF front-end board Metallic -90 -90 X l s fOFFSET =0 17.15 100 MHz 200 300 400 f = 2.4 GHz 0 100 200 300 400 0 100 200t 300 400 rf Frequency kHz T Frequencya kHz Frequency (kHz) f = 16.64 MHz t c f = 100 kHz offset obstacles MOD e je Baseband Interferer 60 Depth of0 MOD. = 0.25 Metallic in LOS M b Interferer 1 objects 40 o -20 RX0-RX3 X TX1 obstructing T 20 -40 ASK Signal Gen. E4438C LOS RF font-end + 0 -60 fLO_RF = 2.4 GHz MIMO RX. -20 Amplitude (dBm) -80 frf = 2.4 GHz f = 17 MHz -40 OFFSET Signal to Interferer Ratio (dB) foffset = 16.58 MHz fLO_RF = 2.4 GHz 0 5 10 15 20 25 0 100 200 300 400 Reflecting path Iteration Symb. rate = 0.2Frequency Mbps (kHz) fLO = 16.5 MHz Depth of MOD. = 0.25 RX0-RX3

[TCAS-I 2018] Received -30 mixture -30 Recovered ASK -30 Recovered FM with interleaved -40 spectra -40 -40

-50 ASK -50 -50 ASK FM FM suppressed -60 -60 by - 38 dB suppressed -60 by -38 dB Amplitude (dBm) Amplitude (dBm) -70 -70 Amplitude (dBm) -70

0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 Frequency (MHz) Frequency (MHz) Frequency (MHz) Over-the-Air Measurements

Tasks: Experimental setup: interleaved FM and ASK in Maximize PFM / PASK non-line-of-sight environment with multipath. Maximize PASK / PFM

Symb. rate = 200 kbps Depth of Mod. = .25 0 Reflecting path -30 -30

-20 -40 -40 FM tones ASK suppressed -40 -50 -50 -38 dB

-60 -60 -60 Amplitude (dBm)

Amplitude (dBm) -70 Amplitude (dBm) -70 -80 Metallic 246810 -80 -80 456789 456789 5 f = 2.4 GHz × 10 rf 5 5 Frequency (kHz) Frequency (kHz) × 10 Frequency (kHz) × 10 foffset = 17 MHz obstacles fMOD = 100 kHz Depth of Mod. = .25 0 FM in LOS -30 ASK tones -20 -40 suppressed -38 dB -50 -40

-60 -60

Amplitude (dBm) -70 Amplitude (dBm) frf = 2.4 GHz -80 f = 17.15 MHz offset -80 246810 Reflecting path 456789 5 × 5 Frequency (kHz) × 10 Frequency (kHz) 10

Baseband signals Baseband measurements

[JSSC 2016] Performance Summary

Zhang&et.&al.& Lee&et.&al.& Buhler&et.&al.& Kim&et.&al.& This&work& ISSCC&2015& ISSCC&2016& VLSI&2016& VLSI&2015& &

Linear& Feature& Sensor& Feature& SpaCal& ApplicaCon& SpaCal& ExtracCon& Classifier& ExtracCon& Filtering& Filtering& CMOS&Technology&(nm)& 180$ 40$ 65$ 65$ 65$ Number&of&channels& 1a$ 1a$ 16a$ 8$ 8$ Parallelism Area&per&MAC&(mm2)& 0.106$ 0.012$ 0.0594$ 0.045$ 0.021$ Power&(μW)& 0.663$ 228$ 3856$ 1300$ 91$ Signal&Bandwidth&(kHz)& 10$ 106$$ 100$ 1500$ 350$ Power/Bandwidth&(μW/MHz)& 66.3$ .228$ 38560$ 866$ 260$ High- EffecCve&Analog&MulCplicand& dynamic 4$ 3$ 14b$ 8c$ 14$ (bit)& range MulCply&Accumulate&Efficiency& d d (pJ/MAC)& 16 $ .12$ 30000 $ 6$ 2$ MulCply&Accumulate&Efficiency&& Energy - /MulCplicand&Level& 1000$ 15$$ 1830$ 23.4$ 0.$12$ efficient (fJ/MAC/Level)& aSerial matrix-vector product. bOversampled, 1-bit per sample. cReported 48 dB signal separation. dNo analog accumulate. How can we get there?

How can we build an artificial systems that are robust, intelligent, and energy efficient?

Codesign hardware and algorithms

§ New sensing and signal acquisition § Rethink how we build computers § Develop hardware-aware algorithms for machine intelligence Compute-in-Memory

Memory Memory Memory DSP

Computing Computing Computing Energy Costs for Memory vs Compute Digital/Analog processing in memory

Compute-in- memory can bring the combined energy of memory In 45nm CMOS access and at 0.9V computation down to 50 fJ/Op

“Computing’s Energy Problem (and what we can do about it)”, M. Horowitz, ISSCC 2014

Memory access energy ≫ Computation energy!!

© 2018 IEEE 31.1: Conv-RAM: An Energy-Efficient SRAM with Embedded Convolution Computation for Low-Power CNN-Based Machine Learning Applications International Solid-State Circuits Conference 7of 48 Compute-in-Resitive Memory

V1 ��� ⋯ ��� (�� ⋯ ��) ⋮ ⋱ ⋮ = (�� ⋯ ��) � ⋯ � G11 G12 … G1n �� �� V 2 Basis of neural network à

G21 G22 G2n … inference and training … … § Reduce data movement V Drivers m § à Low energy consumption Gm1 Gm2… Gmn § High density & parallelism § High throughput I1 I2 In § Larger model ADCs / Sense-amps Lack of Dataflow Versatility in CIM Architectures

Feed-forward Forward & backward Recurrent e.g. MLP / CNN e.g. RBM / Auto-encoder e.g. RNN / LSTM X(t) X(t+1) Ln Ln+1 visible hidden visible

T W W W RECurrent W GENerage/ INFerence INFerence backprop

Implementation REC Synapses Synapses Synapses Drivers Drivers Drivers

Registers

INF INF ADC ADC / Neuron INF GEN INF REC ADC / Neuron Drivers ADC / Neuron REC ADC / Neuron Registers Requires twice neurons Requires porting data RRAM-CIM ASIC with Dataflow Flexibility

130nm CMOS neurons & peripherals + 256×256 HfOx/TEL RRAM weights

Sub-core (j, k) BL 16j … …

WL 16j

… … BL 16j+k

WL 16j+k

… … BL 16j+15

… … WL 16j+15

INF REC

Neuron GEN (j, k)

REC

SL 16k SL

SL 16k+j SL SL 16k+15 SL RRAM-CIM ASIC with Dataflow Flexibility

130nm CMOS neurons & peripherals + 256×256 HfOx/TEL RRAM weights

SPI Chip micrograph GEN

INF 1.34 mm

bit) - REC Neurosynaptic Core Transposable Neurosynaptic Core (256 s REG_BL[0:255] 256 CMOS Neurons & 16x16 WL[0:255] RRAM

65K RRAMs Register

BL/WL DriversBL/WL

BL/WLDrivers

BL/WLRegisters BL/WL

BL[0:255] 1.34mm

SL[0:255]

INF GEN SL Drivers PG SL Drivers REG_SL[0:255] PRN [0:255] SPI SL Registers (256-bit) LFSR PRNG SL Registers & LFSR GEN: Forward MVM INF: Backward MVM REC: Recurrent MVM Power Breakdown (@0.32mW, 23GMACS)

Static (neurons+biasing) 5% Neuron output & all other digital dynamic (1.8V VDD) 74 TMACS/W 26% WL Switching MVM input pulses (0V <-> 3V) 15% 54% How can we get there?

How can we build an artificial systems that are robust, intelligent, and energy efficient?

Codesign hardware and algorithms

§ New sensing and signal acquisition § Rethink how we build computers § Develop hardware-aware algorithms for machine intelligence Keyword Spotting

Windowed audio waveform

State Logits

LSTM 12 class labels cell “yes, no, up down”

Not amenable to traditional compute-in-memory Keyword Spotting Mapping Recurrent Networks

1.34 mm Neurosynaptic Core

16x16

RRAM

BL/WLDrivers

BL/WLRegisters 1.34mm

SL Drivers

SL Registers & LFSR Competitive performance

Energy efficiency improvement without a major sacrifice in accuracy No Free Lunch: Device Imperfections

80.2% 80.2% a) Read variation between devices

b) Resistance Drift over Assume following a Gaussian distribution time

c) Stuck at Fault – (a) (b) Forming/Manufacturing Errors

80.2% 58.9% d) Random Telegraph

R Ratio of RTN= R+DR

(c) (d) RRAM Analog Programming

§ RRAM conductance programmed within desired range using write-verify § Conductance relaxation observed after programming

VWL Reset Pulse Set Pulse Train Train

Time 1 07 )

Ω Acceptanc 1 e Range 06 Reset Set Pulse 1 Pulse Train Train 5 Resistance( 0

1 0 2 4 6 8 10 04 Pulse number X. Zheng et al., IEDM 2018 Crossbar Mapping Methods

Double Element (DE) Bias Column (BC)

For the purpose of this example, we can assume a high-enough resistor represents a 0 Adjacent Connection Matrix

Instead of a constant reference, use each column’s neighbor as the reference. This leads to a more robust system. Reexamine Mapping Weights onto CIM Arrays

while using the same hardware resources. The largest gain is 1-bit model 3-bit model obtained for non-linear weights where an effective 2-bits in weight resolution are recovered leading to an improvement in 80 ACM accuracy of 20%. DE 80 60 BC 60 B. Effects of Device Variation on Neural Network Inference 40 40 Here we evaluate the effects of device-to-device variation Accuracy (%) 20 on inference accuracy in the context of the three different 20 mapping approaches. We evaluate a VGG-like network trained 0 5 10 15 20 25 0 5 10 15 20 25 with CIFAR-10 with different weight precisions. Variation is 4-bit model 6-bit model modeled as a normal distribution with mean zero with sigma values up to 25%. Fig. 8 shows that the bias column approach 90 consistently performs worse than the other two mappings 80 regardless of the bit precision. The ACM performs better than 80 the double element for 1 and 3 bit models and the double 60 element performs better than the ACM for 4 and 6 bit models. The 2 and 5 bit models follow this trend, hence are not Accuracy (%) 40 70 presented in Fig. 8. Note that even though the networks trained with different mappings have achieved different accuracies, the 0 5 10 15 20 25 0 5 10 15 20 25 trends are present when normalized to achieved accuracies. Sigma of variation (%) Sigma of variation (%) By expanding Eq. 1 with the periphery matrix of the ACM presented in Fig. 5c we prove that for any weight matrix M Fig. 8: Effects of variation on the inference accuracy of a Improvedtrained with thetraining ACM which performance represents matrix Y : VGG-likeImproved model trained inference with different performance mapping approaches and bit precisions on the CIFAR-10 dataset

NI NI NI NO TABLE I: System-level Results of the three mapping ap- M M = Y (5) i0 iND ij proaches for training with a two-layered MLP. i=1 i=1 i=1 j=1 X X X X Mapping BC DE ACM This characteristic acts as a regularizer which enforces Read Energy (µJ) 2.402 14.408 2.402 tighter restrictions when the weights have lower bit precisions, Area (µm2) 1071 2334 1071 i.e. 1-3 bits, because Eq. 5 will have fewer possible solutions. Read Delay (ms) 0.240 0.318 0.240 This in turn strengthens the trained network against variation. However, with higher bit precision, i.e. 4-6, the regularizing V. C ONCLUSION factor of the ACM is looser and the redundancy of the double element approach performs better in the face of device We introduced a mapping method, the adjacent connection variation. matrix (ACM), that mitigates the effects of reduced weight resolution and weight update non-linearity on neural network training while imposing minimal hardware overhead. We C. System-level Evaluation demonstrated, both mathematically and by simulation, that Table I shows system-level results for the three approaches. training with ACM has the same capacity as training with the The results are generated using the NeuroSim+ [14] tool. The previous mapping approaches and signed networks . Training read energy and latency values are for one epoch of training experiments with limited resolution and non-linearity showed a mulitlayer perceptron (MLP) network of two layers with that the ACM consistently improved upon the accuracy of an ReXB-based hardware accelerator. Read energy, area, and the bias column approach while using the same hardware read delay values for bias column and ACM approaches are resources. The largest gain was obtained for non-linear weights exactly the same, for there is practically no difference in where an effective 2-bits in weight resolution were recovered their hardware resource utilization. The read energy of double leading to an improvement in accuracy of 20%. The ACM element approach is 7 more than the ACM due to the longer is also more resilient to device variation for inference using ⇥ wires for rows of the ReXB array. The area includes the ReXBs compared to the bias column. Compared to the double ReXB arrays and peripherals for training. The double element element, the ACM can achieve comparable training accura- approach uses 2.3 area compared to the ACM, for using cies and variation resiliency while reducing the read energy ⇥ consumption by 7 and area by 2.3 . twice as many elements and larger peripherals. Furthermore, ⇥ ⇥ the double element has a 1.33 higher read delay due to REFERENCES ⇥ having more columns that need to be multiplexed for using [1] S. Han et al., “Eie: efficient inference engine on compressed deep neural the peripherals. network,” in ISCA, 2016. Performance at Image Reconstruction Performance at Image Classification

Hidden(2-bit) � = 64 11×1 Class 1

� … (5-bit) � … �� … � = 10 Input (1-bit) � = 121 Performance Summary

Improvement mainly comes from using voltage-mode sensing à reduces energy consumption from crossbar static current

(1 TMACS/W = 2 TOPS/W) Precision: (input, weight, output)

2× performance improvement Computing with Emerging Devices

Resistive memories Correlated Oxide Beyond CMOS Flexible electronics Devices 25nm

TiN VO2 gate TiOx/ SiO2 HfOx n+ n+

12 nm p 10 nm

Algorithmic research: § Learning without negative weights Intelligent Application domains systems Infrastructure Robotics/ § Robustness to device failures monitoring Healthcare Automation

§ Energetically-aware algorithms Noise On/Off Pulsing (a) 1 (b) Architectural research: Asynchronous

Analog 1mm Integrator Comparator 1.8mm + 0.5 § Partitioning logical networks to Machine + dt 1.8mm 1mm ⌃ Intelligence physical arrays 0

On/Off R On/Off Output = 1 Pulsed Pulsed Smart sensing Resource = 8 constrained High-dimensional § Intermittently powered intelligence −0.5 Input Output = 16 Research & actuation adaptive = 32 hardware signal processing logistic thrusts −1

Crossbar Synapse Array I&F Neural Units LFSRdn −1.5 −1 −0.5 0 0.5 1 1.5 Circuits research Input § Techniques to drive memory arrays Neurally (c) Inference Phase Learning Phase (d) Inspired § Hybrid CMOS/NVM Circuits/Logic Machine gijVj IntelligencePseudo- j RNG

P + tlearn wBeyondij = gij gij CMOS and wij UiVj Neurally inspired scalable wijVj / g V j I&Fi ij i non-BooleanX computing communicationUi and computing @ Vref (drive) LFSRup Vref + (voltage + Vj gij g clamp) gij gij ij Vj V dd Resistive wij - 0 + Oscillator Pulse Communication V ref Vj Vj Vj Vj MEMS- + 0 - Signed (On/Off) Input/Output Events Vss memories devices Nanokernel Ui 0 0 0 0 t + - 0 + primitives Emerging substrates How can we get there?

How can we build an artificial systems that are robust, intelligent, and energy efficient?

Codesign hardware and algorithms

§ New sensing and signal acquisition § Rethink how we build computers § Create new efficient algorithms that mimic intelligence More than one type of device out there

Resistive memories Correlated Oxide 2D Devices Flexible electronics Devices 25nm

TiN VO2 gate TiOx/ SiO2 HfOx n+ n+

12 nm p 10 nm IMT Nano-oscillators

Single IMT nano-oscillator (IMT-NO)

VDD 200 nm

VO2

TiO2 VO2

Vout

Vgs TiO Pd/Au 2

IMT MIT

2

1.5

Transistor 1

load line voltage (V) 0.5

700 720 740 760 780 time ( s) IMT Oscillators for Signal Acquisition

A : 73.8 kHz 180 Linear Fit 1.0 1 160

Voltage(V) 0.5 0 140 C 1 um B2.36: 132 kHz2.38 2.40 Time (ms) 120 1 B 0.0 100 Voltage(V) Max Non-Linear 0 -0.5

C2.36: 172.82.38 kHz 2.40 80 A Error < 0.5% Frequency (kHz) Frequency Time (ms) 1 60 -1.0

Voltage(V) 0 0.8 1.2 1.6 2.36 2.38 2.40 (%) Non-Linearity to due Error Time (ms) Gate Bias Voltage (V) IMT Oscillators for Signal Acquisition

V1 V2 V3 VDD V V V 35 5.2 dB V V 1 2 3 35 V1 2 3 5.2 dB 30 increase C C 1 C C 30 increase

25 Volatge(V) 0 25 V 2.36 2.38 2.40 20 in V1 20 V1 Time (ms) 15 V2 SINAD(dB) Single osc. Exp . Spice Sim. 15 2 coupled osc. Exp . Spice. Sim. V2 80 SINAD(dB) Single osc. 10 Exp . Spice Sim. V3 10 2 coupled osc. Exp3 .coupled Spice. osc. Sim. Exp . Spice. Sim. V 3 coupled osc. 5 Exp . Spice. Sim. 3 40 clk Count 5 10 40 100 400 clk 10 40 100 400 Additional resolution within0 each count Signal Amplitude (mV) 0 30 120 240 Signal Amplitude (mV) Additional resolution within each count Phase (deg) Metric IMT-NO ADC IMT-NO ADC Ring (Experiment) (Multiphase - Oscillator ADC Projected) [1] Freq (Hz) 4 kHz 20 MHz 20 MHz ENOB 5.04 5.7 7.51 SNR (dB) 39.32 42.8 52.3 SINAD (dB) 32.14 36.05 47.2 SFDR (dB) 32.36 36.31 55.7 FOM 60.25 0.12 10.5 (pJ/step) Power ~120 uW ~160 uW 12.6 mW Oscillator 0.0001 0.0003 0.0012 Area (mm2) But Wait, There’s More! Computing With Physics Applications of Maximum Independent Set (MIS) Problem Input Graph Applications of Maximum Independent Set (MIS) Problem Input Graph p1 d3 p2 p1 d4d3 p2 d1 d4d2 d1 d2 p3 d: Data Bits p3 p: Parityd: Data Bits Bits p: Parity Bits Coding Theory Resource Allocation VLSI Design Information EncodingCoding Theoryand Decoding InterferenceResource Co Allocation-ordination PlacementVLSI Designand Routing Information Encoding and Decoding Interference Co-ordination Placement and Routing Coupled Oscillator Chip with Reconfigurable Coupling Die Photo of Oscillator ChipCoupledSingle Oscillator Oscillator Chip Circuit with ReconfigurableAll to All Reconfigurable Coupling Coupling scheme V Oscillator Hardware Die Photo of Oscillator Chip Single VOscillatorDD DD Circuit All to All ReconfigurableO1 Coupling scheme Current Starver V Oscillator Hardware VDD DD Adjacency Matrix, [A] O1O2 O3 O30 (header)Current Starver VBP Adjacency Matrix, [A] O1 O2 O3 O30 (header) VBP 0 1 1 …….. 1 O1 ……. Switched 0 1 1 …….. 1 ……. CapacitorSwitched 1 0 1 …….. 1 O2 ……. Capacitor 1 0 1 …….. 1 O2 ……. Output Buffer 1 1 0 …….. 1 O3 ……. OutputCoupling VDD 1 1 0 …….. 1 O3 ……. MUX Coupling BufferC V MUX C DD network Schmitt network Schmitt ……. ……. ……. ……. ……. ……. ……. ……. trigger ……. Oscillators ……. ……. ……. ……. ……. ……. ……. ……. ……. trigger ……. Oscillators Inverter 0 1 1 …….. 0 O30 …….……. Inverter 0 1 1 …….. 0 ……. VBN O30 Current Starver VABN=0; no edge between node i & j Current Starver ij A =0; no edge between node i & j (footer) A =1;ij edge between node i & j (footer) ij A =1; edge between node i & j Switch Control (SIPO Register) ij Switch Control (SIPO Register)

ComputingComputing MIS MIS Solution Solution MISMIS Solution Solution OutputOutput Waveform Waveform PhasePhase Plot Plot 9090 SolutionSolution 90 1 1 90 3 6IndependentIndependent SETsSETs (ISs)(ISs) 3 644 3 3 66 4 4 44 NodesNodes SizeSize ResultResult 6 6 180180 {1,4,6}{1,4,6}11 00 33 MISMIS 180180 11 00 3 3 {3}{3} 11 IS1IS1 55 Voltage (V) Voltage Voltage (V) Voltage 2 2 {2,5} 2 IS2 2 55 2 {2,5} 2 IS2 22 270 5 5 270 maximum independent set 270270 TimeTime (μs) (μs)

Fig. Fig.1 | C 1omputing | Computing MIS MIS using using coupled coupled oscillators. oscillators. (Top (Top panel) panel) Practical Practical a applicationpplicationss of of the the MISMIS problemproblem – a computationally– a computationally hard hard combinatorial combinatorial optimization optimization problem problem; (Middle; (Middle panel) panel) Die Die photo, photo, oscillator oscillator circuitcircuit and schematicschematic of the of the all- to all-all-to -all coupling coupling scheme scheme for for the the coupled coupled oscillator oscillator IC IC implemented; implemented; (Bottom (Bottom panel) panel) ExperimentalExperimental time time domain domain waveform, waveform, phase phase sequence sequence, and, and the the independent independent sets sets computed computed from from the coupled oscillator dynamics. The largest independent set approximates the MIS. coupled oscillator dynamics. The largest independent set approximates the MIS.

Computing the Maximum Independent Set using Coupled Oscillators. Computing the Maximum Independent Set using Coupled Oscillators.

The MIS of a graph G (V, E) (V: Vertices; E: Edges) is defined as the largest subset of The MIS of a graph G (V, E) (V: Vertices; E: Edges) is defined as the largest subset of nodes having no edges among them33. The MIS problem is an archetypal combinatorial nodes having no edges among them33. The MIS problem is an archetypal combinatorial optimization problem with extensive applications in coding theory34, resource allocation35, optimization problem with extensive applications in coding theory34, resource allocation35, molecular biology36, and VLSI design37 (Fig. 1). To the best of our knowledge, the molecular biology36, and VLSI design37 (Fig. 1). To the best of our knowledge, the (Fig. 4b). The simulations reveal that for larger graphs (specifically with high sparsity), the oscillators yield a lower quality sub-optimal (yet correct) solution. we observe empirically that the phase sequence tends to miss a few nodes of the optimal MIS solution. We, therefore, implement a simple scheme of expanding the largest observed independent set from the sequence to achieve a significantly improved MIS solution. The proposed scheme and as obtained oscillator results are shown in the supplementary section S5. As revealed in Fig. 4(a)(b), we observe near- optimal/optimal solution for the random graphs, and optimal solutions for all except one of the DIMACS graphs analyzed here. sizes reduce the average phase separation among the oscillators and make the optimal (a) 20 64 node 128 node 160 node phase ordering increasingly susceptible to noise and non-idealities. Furthermore, this

16 effect is more prominent in sparer graphs (low η) where the average size of the

Algorithm) 12

K -

(Fig. 4b). The(B 8 simulationsBut Wait, reveal thatThere’s for larger graphs More! (specifically Computing withindependent high sparsity) setWith, theis expected Physics to be larger (see supplementary section S3).

MIS 4 oscillators yield a lower quality sub-optimal (yet correct) solution. we observe empirically 4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 average connectivity=0.2 average connectivity=0.4 12 30 averageη=0.2 connectivity=0.2 average connectivity=0.4 (ac2) MIS (Coupled Oscillator) averageη=0.4 connectivity=0.6 average connectivity=0.8 (a) 5 node 10 node (b) average connectivity=0.2 averageaverage connectivity=0.4connectivity=0.6 average connectivity=0.8 24 average connectivity=0.2 averageη=0.6 connectivity=0.4 10 ) average connectivity=0.6 averageη=0.8 connectivity=0.8 that the phase(b) DIMACS sequence Nodes tendsEdges to missMIS asolution few nodes from ofMIS the Solution optimal MIS solution. We, % 18 average connectivity=0.6 average connectivity=0.8 X Axis Title X Axis Title X Axis Title 8 ( (ac4) Graph oscillators (w/ from DIMACS 12

6 δ )

post-processing) 6

% (?Y) %

4 %

( H therefore, implement a simple scheme of expanding the largest observed independent 80 0

1tc_32 32 68 12 12 δ

2 ,

5 10 15 20 1dc_64 64 543 10 10 (?Y) % No. of Partitions (N) 5 10 15 20 10 15 node 20 node 5 10 15 20 K set from the 1tc_64sequence 64to achieve192 a significantly20 improved 20MIS solutionAlgorithm) . The proposed 60 5 10 15 20 8 %5 10(?X)15 20

1et_64 64 264 18 18 K X Axis Title - 6 Solution X Axis Title X Axis Title scheme and2dc_128 as obtained128 oscillator5173 results are 5shown in the supplementary5 (B 4section S5. As 40 X Axis Title

1dc_128 128 1471 16 16 (?Y) % 2 MIS

optimal 30 revealed in Fig.1et_128 4(a)(b),128 we observe672 near- optimal28 /optimal solution28 for the random10 25 node graphs, 30 node 25 8

1zc_128 128 2240 18 18 from 20 20 2dc_256 256 17183 6 7 6 15

and optimal solutions for all except one of the DIMACS graphs analyzed here.4 10 5 % (?Y) % 2 0

Fig. 4 | Scalability of coupled oscillator approach (a) Bubble plots comparing the MIS solution Deviation 2 4 6 8 10 2 4 6 8 10 12 0.2 0.4 0.6 0.8 obtained from the coupled oscillators (post expansion step) relative to the optimal solution (from the B- MIS (Coupled Oscillator) Average Connectivity(η) (a) % (?X) % (?X) K algorithm). (b) Graph instances from the DIMACS implementation challenge solved using coupled oscillators; the oscillators20 64computenode the optimal MIS128 solutionnode in all except one160 graph.node Fig. 2 | Computational characteristics of the coupled oscillators. (a) MIS solution obtained from the 16 coupled oscillators and its comparison with the optimal solution (computed using the B-K algorithm). (b) Effect of graph size (nodes) and connectivity on the quality of the computed MIS solution (expressed as a Algorithm) 12

K deviation (δ) from the optimal solution. (Inset) shows δ as a function of the number of partitions obtained -

(B 8 from the oscillator phases (for different η). Computing optimal MIS solution is the most challenging in sparse graphs having a larger number of partitions.

MIS 4

4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 MIS (Coupled Oscillator) Next, we focus on the temporal dynamics of the coupled oscillators and quantify the

(b) DIMACSX Axis Nodes Title EdgesX AxisMIS Titlesolution fromX AxisMIS Solution Title Graph oscillators (w/ from DIMACSnumber of cycles (Nc) required by the system to settle to the desired phase sequence for post-processing) 1tc_32 32 68 12 12the MIS solution. This would be a critical consideration in optimizing the system 1dc_64 64 543 10 10 1tc_64 64 192 20 20throughput and the performance of a practical coupled oscillator-based computing 1et_64 64 264 18 18 2dc_128 128 5173 5 5 platform. Fig. 3a shows the evolution of NC as a function of V and η; NC is averaged over 1dc_128 128 1471 16 16 1et_128 128 672 28 28 the three graph instances measured for each (V, η). NC exhibits a similar trend with V and 1zc_128 128 2240 18 18 2dc_256 256 17183 6 7 η as that observed for the quality of the MIS solution; the coupled oscillators require more Fig. 4 | Scalability of coupled oscillator approach (a) Bubble plots comparing the MIS solution obtained from the coupled oscillators (post expansion step) relative to the optimaltime solution to converge (from the B to- the (near-) optimal solution for sparser graphs than graphs with higher K algorithm). (b) Graph instances from the DIMACS implementation challenge solved using coupled oscillators; the oscillators compute the optimal MIS solution in all except one graph.

Computing with Emerging Devices

Resistive memories Correlated Oxide 2D devices Flexible electronics Devices 25nm VO2 TiN gate TiOx/ SiO2 HfOx n+ n+

12 nm p 10 nm

Opportunities: Challenges: § On-chip magnetics § Device variability § New computational primitives § Device stochasticity § Continuous time computation § Scale § Intermittent computation § Integration Applications § Smart dust § Smart closed-loop § Computing directly § Realtime learning implants with probabilities and inference § Efficient algorithms § Optimization for swarming accelerators Computing with Emerging Devices

Resistive memories Correlated Oxide 2D Devices Flexible electronics Devices 25nm

TiN VO2 gate TiOx/ SiO2 HfOx n+ n+

12 nm p 10 nm

Algorithmic research: § Learning without negative weights Intelligent Application domains systems Infrastructure Robotics/ § Robustness to device failures monitoring Healthcare Automation

§ Energetically-aware algorithms There’s an entire world ofNoise possibilitiesOn/Off Pulsing (a) out1 (b) Architectural research: Asynchronous

Analog 1mm Integrator Comparator 1.8mm + 0.5 § Partitioning logical networks to Machine + there! dt 1mm 1.8mm ⌃ Intelligence physical arrays 0

On/Off R On/Off Output = 1 Pulsed Pulsed Smart sensing Resource = 8 constrained High-dimensional § Intermittently powered intelligence −0.5 Input Output = 16 Research & actuation adaptive = 32 hardware signal processing logistic thrusts −1

Crossbar Synapse Array I&F Neural Units LFSRdn −1.5 −1 −0.5 0 0.5 1 1.5 Circuits research Input § Techniques to drive memory arrays Neurally (c) Inference Phase Learning Phase (d) Inspired § Hybrid CMOS/NVM Circuits/Logic Machine gijVj IntelligencePseudo- j RNG

P + tlearn wBeyondij = gij gij CMOS and wij UiVj Neurally inspired scalable wijVj / g V j I&Fi ij i non-BooleanX computing communicationUi and computing @ Vref (drive) LFSRup Vref + (voltage + Vj gij g clamp) gij gij ij Vj V dd Resistive wij - 0 + Oscillator Pulse Communication V ref Vj Vj Vj Vj MEMS- + 0 - Signed (On/Off) Input/Output Events Vss memories devices Nanokernel Ui 0 0 0 0 t + - 0 + primitives Emerging substrates