Neuromorphic Computing, Sensing, and Communication in the Post Moore Era
Total Page:16
File Type:pdf, Size:1020Kb
Neuromorphic Computing, Sensing, and Communication in the Post Moore Era Siddharth Joshi Intelligent Microsystems Laboratory [email protected] 326B Cushing University of Notre Dame What is possible? Microwasp with 7500 neurons of which 3000 process input from the eyes. Single celled Protozoa 200 μm. How can we get there? How can we build an artificial systems that are robust, intelligent, and energy efficient? Codesign hardware and algorithms § New sensing and signal acquisition § Rethink how we build computers § Create new efficient algorithms that mimic intelligence How can we get there? How can we build an artificial systems that are robust, intelligent, and energy efficient? Codesign hardware and algorithms § New sensing and signal acquisition § Rethink how we build computers § Develop hardware-aware algorithms for machine intelligence Adaptive Low-Power Sensory System Digital Sensory Processing Analog Digital Outputs Sensor AFE ADC DSP Adaptation • General-purpose • Precision and dynamic range are limited by ADC • Overengineered @ 12b, 45nm: E(ADC) ~ 200 pJ / sample @12b, 45nm: E(DSP) ~ 2pJ/MAC 2x energy (E) for every additional ADC bit ADC Energy Trends Historical Energy Trends for ADC /Sample pJ ADC Energy is ExponentiaL With ResoLution ENOB (bits) MSP Energy Bounds -10 Input signal Linear transform Processed signal 10 ADC ADC+aMVM processing gain 60 dB x W y ADC+aMVM processing gain 40 dB ADC+aMVM processing gain 20 dB Signal acquisition W x y 10-15 = FeatureFeature extractionextraction System Energy (J) Employing MSP can lead to Classification x xxx 1000X energy efficiency x o o -20 ooo 10 0 20 40 60 80 100 System Dynamic Range (dB) Continuous-time, capacitive, mixed-signal processing (MSP) can implement linear transforms with extreme energy-efficiency Joshi et al – CICC 2017 Dot Product Unit Processor is composed of arrayed Dot Product Units (DPUs) Single DPU Channel W1 W1 W1 W1 W1 W1 W1 W1 1 2 3 4 5 6 7 8 DPU 1 W2 W2 W2 W2 W2 W2 W2 W2 1 2 3 4 5 6 7 8 W3 W3 W3 W3 W3 W3 W3 W3 1 2 3 4 5 6 7 8 W4 W4 W4 W4 W4 W4 W4 W4 1 2 3 4 5 6 7 8 W5 W5 W5 W5 W5 W5 W5 W5 1 2 3 4 5 6 7 8 W6 W6 W6 W6 W6 W6 W6 W6 1 2 3 4 5 6 7 8 W7 W7 W7 W7 W7 W7 W7 W7 1 2 3 4 5 6 7 8 W8 W8 W8 W8 W8 W8 W8 W8 1 2 3 4 5 6 7 8 DPU 8 Chip Joshi et al – ISSCC 2017 Dot Product Unit W1,j W8,j Aj yj 14b 14b Parallel i=8 accumulate at VGA input. Nested Thermometer yj = Aj xiWi,j Multiplying DAC i=1 Joshi et al – ISSCC 2017 MIMO Communication Constellation of the mixture 16-QAM+64-QAM 0.5 0.5 0 0 Q (V) Q (V) -0.5 -0.5 -0.5 0 0.5 -0.5 0 0.5 64-QAM I (V) I (V) 16-QAM Mixture of quadrature amplitude modulation signals (QAM) with indistinguishable spectra Joshi et al – ISSCC 2017 MIMO Communication Spatial Filtering separates the mixture Constellation 64-QAM resolved 16-QAM resolved RMS EVM 2.9% RMS EVM 3.1% Joshi et al – ISSCC 2017 Experimental Setup Experimental setup: Two sinusoids, non-line-of- sight environment with multipath. Task: Maximize P1 / P2 Measured Mixture Isolated Signal Baseband Signal -30 -30 TEST board 0 Reflecting path -40 -40 Signal to Interferer Signal to Interferer Antennae -20 Ratio increased to LO Signal -50 Ratio at baseband -50 41 dB input -24 dB board TX0 -40 -60 -60 Signal Gen. E4438C λ/4 DUT -60 -70 -70 Amplitude (dBm) FM Amplitude (dBm) HRM-MAC IC fLO_RF Amplitude (dBm) = 2.4 GHz -80 -80 c -80 0 li RF front-end board Metallic -90 -90 X l s fOFFSET =0 17.15 100 MHz 200 300 400 f = 2.4 GHz 0 100 200 300 400 0 100 200t 300 400 rf Frequency kHz T Frequencya kHz Frequency (kHz) f = 16.64 MHz t c f = 100 kHz offset obstacles MOD e je Baseband Interferer 60 Depth of0 MOD. = 0.25 Metallic in LOS M b Interferer 1 objects 40 o -20 RX0-RX3 X TX1 obstructing T 20 -40 ASK Signal Gen. E4438C LOS RF font-end + 0 -60 fLO_RF = 2.4 GHz MIMO RX. -20 Amplitude (dBm) -80 frf = 2.4 GHz f = 17 MHz -40 OFFSET Signal to Interferer Ratio (dB) foffset = 16.58 MHz fLO_RF = 2.4 GHz 0 5 10 15 20 25 0 100 200 300 400 Reflecting path Iteration Symb. rate = 0.2Frequency Mbps (kHz) fLO = 16.5 MHz Depth of MOD. = 0.25 RX0-RX3 [TCAS-I 2018] Received -30 mixture -30 Recovered ASK -30 Recovered FM with interleaved -40 spectra -40 -40 -50 ASK -50 -50 ASK FM FM suppressed -60 -60 by - 38 dB suppressed -60 by -38 dB Amplitude (dBm) Amplitude (dBm) -70 -70 Amplitude (dBm) -70 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9 Frequency (MHz) Frequency (MHz) Frequency (MHz) Over-the-Air Measurements Tasks: Experimental setup: interleaved FM and ASK in Maximize PFM / PASK non-line-of-sight environment with multipath. Maximize PASK / PFM Symb. rate = 200 kbps Depth of Mod. = .25 0 Reflecting path -30 -30 -20 -40 -40 FM tones ASK suppressed -40 -50 -50 -38 dB -60 -60 -60 Amplitude (dBm) Amplitude (dBm) -70 Amplitude (dBm) -70 -80 Metallic 246810 -80 -80 456789 456789 5 f = 2.4 GHz × 10 rf 5 5 Frequency (kHz) Frequency (kHz) × 10 Frequency (kHz) × 10 foffset = 17 MHz obstacles fMOD = 100 kHz Depth of Mod. = .25 0 FM in LOS -30 ASK tones -20 -40 suppressed -38 dB -50 -40 -60 -60 Amplitude (dBm) -70 Amplitude (dBm) frf = 2.4 GHz -80 f = 17.15 MHz offset -80 246810 Reflecting path 456789 5 × 5 Frequency (kHz) × 10 Frequency (kHz) 10 Baseband signals Baseband measurements [JSSC 2016] Performance Summary Zhang&et.&al.& Lee&et.&al.& Buhler&et.&al.& Kim&et.&al.& This&work& ISSCC&2015& ISSCC&2016& VLSI&2016& VLSI&2015& & Linear& Feature& Sensor& Feature& SpaCal& ApplicaCon& SpaCal& ExtracCon& Classifier& ExtracCon& Filtering& Filtering& CMOS&Technology&(nm)& 180$ 40$ 65$ 65$ 65$ Number&of&channels& 1a$ 1a$ 16a$ 8$ 8$ Parallelism Area&per&MAC&(mm2)& 0.106$ 0.012$ 0.0594$ 0.045$ 0.021$ Power&(μW)& 0.663$ 228$ 3856$ 1300$ 91$ Signal&Bandwidth&(kHz)& 10$ 106$$ 100$ 1500$ 350$ Power/Bandwidth&(μW/MHz)& 66.3$ .228$ 38560$ 866$ 260$ High- EffecCve&Analog&MulCplicand& dynamic 4$ 3$ 14b$ 8c$ 14$ (bit)& range MulCply&Accumulate&Efficiency& d d (pJ/MAC)& 16 $ .12$ 30000 $ 6$ 2$ MulCply&Accumulate&Efficiency&& Energy - /MulCplicand&Level& 1000$ 15$$ 1830$ 23.4$ 0.$12$ efficient (fJ/MAC/Level)& aSerial matrix-vector product. bOversampled, 1-bit per sample. cReported 48 dB signal separation. dNo analog accumulate. How can we get there? How can we build an artificial systems that are robust, intelligent, and energy efficient? Codesign hardware and algorithms § New sensing and signal acquisition § Rethink how we build computers § Develop hardware-aware algorithms for machine intelligence Compute-in-Memory Memory Memory Memory DSP Computing Computing Computing Energy Costs for Memory vs Compute Digital/Analog processing in memory Compute-in- memory can bring the combined energy of memory In 45nm CMOS access and at 0.9V computation down to 50 fJ/Op “Computing’s Energy Problem (and what we can do about it)”, M. Horowitz, ISSCC 2014 Memory access energy ≫ Computation energy!! © 2018 IEEE 31.1: Conv-RAM: An Energy-Efficient SRAM with Embedded Convolution Computation for Low-Power CNN-Based Machine Learning Applications International Solid-State Circuits Conference 7of 48 Compute-in-Resitive Memory V1 ��� ⋯ ��� (�� ⋯ ��) ⋮ ⋱ ⋮ = (�� ⋯ ��) � ⋯ � G11 G12 … G1n �� �� V 2 Basis of neural network à G21 G22 G2n … inference and training … … § Reduce data movement V Drivers m § à Low energy consumption Gm1 Gm2… Gmn § High density & parallelism § High throughput I1 I2 In § Larger model ADCs / Sense-amps Lack of Dataflow Versatility in CIM Architectures Feed-forward Forward & backward Recurrent e.g. MLP / CNN e.g. RBM / Auto-encoder e.g. RNN / LSTM X(t) X(t+1) Ln Ln+1 visible hidden visible T W W W RECurrent W GENerage/ INFerence INFerence backprop Implementation REC Synapses Synapses Synapses Drivers Drivers Drivers Registers INF INF ADC ADC / Neuron INF GEN INF REC ADC / Neuron Drivers ADC / Neuron REC ADC / Neuron Registers Requires twice neurons Requires porting data RRAM-CIM ASIC with Dataflow Flexibility 130nm CMOS neurons & peripherals + 256×256 HfOx/TEL RRAM weights Sub-core (j, k) BL 16j … … WL 16j … … BL 16j+k WL 16j+k … … BL 16j+15 … … WL 16j+15 INF REC Neuron GEN (j, k) REC SL 16k SL SL 16k+j SL SL 16k+15 SL RRAM-CIM ASIC with Dataflow Flexibility 130nm CMOS neurons & peripherals + 256×256 HfOx/TEL RRAM weights SPI Chip micrograph GEN INF 1.34 mm bit) - REC Neurosynaptic Core Transposable Neurosynaptic Core (256 s REG_BL[0:255] 256 CMOS Neurons & 16x16 WL[0:255] RRAM 65K RRAMs Register BL/WLDrivers BL/WL Drivers BL/WL Registers BL/WL BL[0:255] 1.34 mm SL[0:255] INF GEN SL Drivers PG SL Drivers REG_SL[0:255] PRN [0:255] SPI SL Registers (256-bit) LFSR PRNG SL Registers & LFSR GEN: Forward MVM INF: Backward MVM REC: Recurrent MVM Power Breakdown (@0.32mW, 23GMACS) Static (neurons+biasing) 5% Neuron output & all other digital dynamic (1.8V VDD) 74 TMACS/W 26% WL Switching MVM input pulses (0V <-> 3V) 15% 54% How can we get there? How can we build an artificial systems that are robust, intelligent, and energy efficient? Codesign hardware and algorithms § New sensing and signal acquisition § Rethink how we build computers § Develop hardware-aware algorithms for machine intelligence Keyword Spotting Windowed audio waveform State Logits LSTM 12 class labels cell “yes, no, up down” Not amenable to traditional compute-in-memory Keyword Spotting Mapping Recurrent Networks 1.34 mm Neurosynaptic Core 16x16 RRAM BL/WL Drivers BL/WL Registers 1.34 mm SL Drivers SL Registers & LFSR Competitive performance Energy efficiency improvement without a major sacrifice in accuracy No Free Lunch: Device Imperfections 80.2% 80.2% a) Read variation between devices b) Resistance Drift over Assume following a Gaussian distribution time c) Stuck at Fault – (a) (b) Forming/Manufacturing Errors 80.2% 58.9% d) Random Telegraph Noise R Ratio of RTN= R+DR (c) (d) RRAM Analog Programming § RRAM conductance programmed within desired range using write-verify § Conductance relaxation observed after programming VWL Reset Pulse Set Pulse Train Train Time 1 07 ) Ω Acceptanc 1 e Range 06 Reset Set Pulse 1 Pulse Train Train 5 Resistance( 0 1 0 2 4 6 8 10 04 Pulse number X.