A FeFET Based Processing-In-Memory Architecture for Solving Distributed Least-Square Optimizations

Insik Yoon1, Muya Chang1, Kai Ni2, Matthew Jerry2, Samantak Gangopadhyay1, Gus Smith3, Tomer Hamam1, 3 1 4 2 1 Vijaykrishnan Narayanan , Justin Romberg , Shih-Lien Lu , Suman Datta and Arijit Raychowdhury 1Georgia Institute of Technology, Atlanta, GA 30318, USA, 2University of Notre Dame, Notre Dame, IN 46556 USA 3Pennsylvania State University, State College, PA 16801 USA, 4TSMC, Taiwan Email: [email protected] /Phone: (678)643-5264 Introduction: HfO2 based ferroelectric FET (FeFET) has recently received great interest for its application in nonvolatile memory (NVM) [1]. Unlike conventional perovskite based ferroelectric materials, HfO2 is CMOS compatible and retains ferroelectricity for thin film with thickness around 10 nm. Therefore, successful integration of ferroelectric HfO2 into advanced CMOS technology makes this technology highly promising for NVM [1]. Moreover, by tuning the portion of switched ferroelectric domain, a FeFET can exhibit multiple intermediate states, which enables its application as an analog conductance in mixed-signal in-memory computing. Currently, such architectures have been applied to neuromorphic computing [2,3]. In this paper, we present a processing-in-memory (PIM) architecture with FeFETs and demonstrate how this can be used to solve a new class of optimization problems, in particular, distributed least square minimization. Device Operation and Modeling: The operation of FeFET is illustrated in Fig. 1. A positive/negative gate pulse erase/program the device by setting the ferroelectric polarization to point toward the channel/gate, which would set the device in low/high VTH state, respectively. The corresponding IDS-VGS characteristics for the two states are shown in Fig. 1(c). The separation of the two states indicate the memory window (MW), which is around 1.0 V for 10 nm HfO2 FeFET [1]. By applying a read pulse, VR, the two states can be differentiated by sensing the read current. The operation of FeFET as an analog conductance is different from the binary memory, in that a series of weak pulses are applied to set the device in desired state [2]. Out of the various pulse schemes proposed in [3] to tune the state, we use the pulse-amplitude modulation scheme (Fig. 2). Fig. 2(e) shows the applied pulse waveform. After each pulse, the percentage of switched ferroelectric domains is modified. The device states are shown in Fig. 2 (a)-(d). The device IDS-VGS corresponding to different states are shown in Fig. (f), which shows the intermediate states. The different states could be sensed by applying a read pulse, VR, the corresponding drain-to-source conductance, GDS, can be sensed. Fig. 2(g) shows the ideal GDS as a function of applied pulse numbers. GDS increases/decreases linearly and symmetrically with pulse numbers during potentiation/depression, respectively. The modeling framework for FeFET is shown in Fig. 3(a), which is composed of two sub-components: (1) baseline 45nm MOSFET, which is models QMOS(VMOS) in BSIM 4, and (2) the ferroelectric which models QFE(VFE) using the dynamic Preisach model. Fig. 4(a) and (b) show the measured and simulated ID-VG characteristics under potentiation and Fig. 5 shows FeFET channel conductance vs. the number of pulses from both simulations and experiments. Architecture and Application: In Fig. 6, we present the cell schematic of a differential FeFET that computes MAC on positive and negative numbers. One number is stored as a differential value of conductance in two FeFETs and the other number is applied as a differential input voltage to BL1 and BL2. As WL is asserted, differential current flows to the SL and an ADC digitizes the result. The vector is mapped to input voltage in BL and numbers in matrix are mapped to conductance in FeFET array. Fig.7 describes the end-to-end and vertically integrated model from device to architecture. We use this architecture to demonstrate distributed least-square optimization (with wide applications to signal processing, machine learning etc.) which solves an inverse problem of the form Az=B where A is the Grammian matrix of the basis and B is the observation vector [4]. We show that this problem can be mapped to iterative and distributed PIM where each unit is shown in Fig. 8. In Fig. 9, we compare compute latency and energy dissipation from a compute unit with FeFET memory vs. DAC resolution and number of bits per FeFET cell. The figure also shows the design-space exploration of ADC, DAC bit resolution and number of bits per FeFET cell in terms of latency and energy dissipation. At a system level, this results in 21X (3X) improvement in energy-efficiency (performance) compared to an SRAM+ALU architecture. We use the proposed algorithm and architecture for solving a classic problem of signal reconstruction from non-uniform samples: (1) Signal reconstruction from 1D EEG Signals and (2) Recovery of CT Images used in medical imaging in Fig. 10. We note more than 35dB PSNR after signal reconstruction. Conclusion: In this paper, we present a FeFET based processing in memory architecture and demonstrate its application to solving distributed least squares optimizations. References: [1] H. Mulaosmanovic et al., VLSI Tech. Dig., 2017. [2] S. Oh, et al., IEEE Electron Device Letters, pp. 732-735, 2017. [3] M. Jerry, et al., IEDM, 2017. [4] Ferreira et. al., Signal Processing, 1994.

978-1-5386-3028-0/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on June 28,2021 at 14:04:12 UTC from IEEE Xplore. Restrictions apply. (a) (e) (a) Model Components ERS G b c d V (a) S D d c b a G G n+ n+ p-Si (f) Preisach Model S D VFE QFE (VFE) (b) G

n+ n+ S D S VG = VFE + VMOS

(c) D n+ n+ I QFE (VFE) = QMOS (VMOS) p-Si p-Si BSIM 4 VR VMOS (c) QMOS (VMOS)

S G ψS

PGM D S D (g) VGS I Here b is the sample vector, z is the coefficient vector obtained (b) n+ n+ a G by stacking the coefficients ↵ (kx , ! x ,ky , ! y ), and A isreferred p-Si (b) Preisach Model Components MW to as the Grammian (Gram)S matrix of the basis. S D VR D (d) GWhen the size of A matrix islarge (as in most applications)VFEa Veff QFE G Static Preisach n+ n+ S D Delay Unit n+ direct solutionnis+not possible. Therefore, alternatively wefollow Model VGS an iterative approach, the Jacobi method. A general update of dVeff VFE - Veff P (V ) = mF(V ) + b : inner loop p-Si pz-Sini j th component at the k iteration is given as (4), where = FE eff eff th dt τ F(Veff) = Qotanh[a(Veff - VC)]: saturated loop B = A T A and c = A T b. Pulse num X Fig. 1. (a) Erase and (b) program operation Fig. 2. (a) – (d) Fek EFT− 1 states w.r.tk − 1 Fig. 3. (a) Components of compact model zj = B jj (cj − B jizi ) (4) Fig. 3: (a) simulated FerroFET channel conductance (b)Measured of binary FeFET memory. (c) Device IDS- ferroelectric domain switchingi =6 j . (e) for FeFEFerrToFET andchannel(bconductance) the dyna(GDS) asmica functionPreisachof pulse number. Some observations are worth emphasizing: (1) To update z k , j and pulse-amplitude modulation scheme [10]. For illustration, VGS characteristics after the program/erase. pulse amonlyplitvaluesudefrommodulpreviousatiiterationon scareheneed.me.(2) Columnsmodeofl components which is composed of a Fig. 2 illustrates the operation with pulse-amplitude modulation A are coupled only with neighboring frames, which leads to The separation of the two curves is the (f) IDS-VGS characteristics after each delay unischeme,t andwhichtheisstausedticin thisPrpapereisach. Fig. 2(e) modeshowsl. the applied simpler computation of B . Such a system maps naturally to ji pulse waveform. After each pulse, the percentage of switched a systolic PIM architecture with (1) near neighbor connections memory window. pulse. (g) measured w.r.t applied pulse ferroelectric domains is modified. The device states are shown and (2) embedded linear algebraic operators on the periphery number. in Fig. 2 (a)-(d).PoThetentdeiavicetionIDS-VGS correspondingDepressionto different -5 10-5 of the subarray – asPwillotenbetiadescribedtion in theDepfolloresswingion sections. 10 Simulation 3.00E-04 states are shown in Fig. 2 (f), which shows the intermediate Experiment 16 states V : 1.0V to 4.0V III. FERROFETS -MODELING AND EXPERIMENTAL states. The different states could be sensed16 sbytateapplyings a read V : 2.85V to 4.45V ERS E RS V ERIFI CATION with with ] 2.50E-04 pulse, VR, the corresponding drain-to-source conductance, GDS, -6 Step = 0.1 V Step = 0.05 V S linearity linearity 10 [ -6 HfO2 based ferroelectricSimulatFETion (FerroFET) has recently re- can be sensed. Fig. 2 (g) shows the ideal GDS as a function e 10 f r f r o c 2.00E-04 e o e ceived great interest for its application in nonvolatile memory of applied pulse numbers. GDS increases/decreases linearly ) Experimental e b b n s e a ) m a A s m with pulse number during potentiation/depression, respectively.

u (NVM)t [8]. It is CMOS compatible and retains ferroelectricity e ( Data

A -7 r n a u 1.50E-04 c ( c e

D 10 n e r A symmetrical potentiation/depression is necessary for high n foru thin film with thickness around 10 nm. By tuning the portion I D I s l c e I -7 u n s d p I l of switched ferroelectric domain, a FerroFET can exhibit mul- accuracy computation. The experimental procedure is outside

n 1.00E-04 10 u p tipleo intermediate states, which has been used in neuromorphic the scope of this paper and is described in [11]. The FerroFET

-8 C 10 computing5.00E[9],-05 [10]. model includes atomistic simulation of domain dynamics with a The operation of FerroFET as an multi-valued eNMV storage drift-diffusion based FET model. The simulation results closely -8 10 is different0.00Efrom+00 a traditional(a) binary memory [8] in that a series match the experimental(b) data and is shown in Fig. 3 where the (a) -9 (b) 10 of weak pulses are applied to set the device in a desired state [9], different conductance levels are shown as a function of the -0.5 0.0 0.5 1.0 -0.5 0.0 0.5 1.0 0 8 16 24 32 40 48 56 64 [10]. Various pulse schemes are proposed to tune the state, in- number of programming pulses. In our current exploration we V (V) V (V) # Pulses G G cluding identical pulse scheme, pulse-width modulation scheme, use 16 states (4 bits), which exhibit high linearity, to encode information. Fig. 4. (a) simulated (b) measured IDS-VGS Fig. 5. FerroFET channel conductance (GDS) as a function of pulse IV. FERROFET PIM A RCHITECTURE:END-TO-END TOOL characteristics of FerroFET w.r.t. increasing numbe(a) r fromG (a) simul(atie) on (b) experiment b c d CHAIN DEVELOPM ENT S D d c b a number of potentiation gate pulses n+ n+ p-Si (f) System level: below as an accelerator GL0+ GL0- GL1+ GL1- GL2+ GL GL31+ GL31- 2- (GEM5, C/C++) Z0 [11:0](b) ZG1 [11:0] Z2 [11:0] Z31 [11:0] S S D y D I n+ n+ h

DAC DAC DAC DAC c Processor: 64 cores, near neighbor connections

Z0 Z0 Z1p-Si Z1 Z2 Z2 Z31 Z31 r VR

a (synthesized, verilog, C/C++)

- + - + - + - + r WL0 (c) G e

S D (g) VGS i

n+ n+ a ... h Core: FeFET memory array, peripherals and communication unit (synthesized, verilog) B0,0+ B0,0- B0,p1+-Si B0,1- B0,2+ B0,2- B0,31+ B0,31- n IR,0 S g i WL1 (d) D G s G

S D e ... FeFET Memory array(spice, verilog, verilogA) n+ n+ D B B I B1,0+ B1,0- B1p,1-+Si B1,1- B1,2+ B1,2- 1,31+ 1,31- R,1 . . . .

. . . Pulse num . . . . . FerroFETs (verilogA, experimental data) WLN Fig. 2: (a) (d) show different FerroFET states, corresponding to different portions of ferroelectric domain switching. The.yellow.. arrows Fig. 7. Flow-chart of BN,0indicate+ BN,0-the polarizationBN,1+ BdirN,1-ection. TheBN,2+ blue/rBNed,2- circles reprBNesent,31+ BN,31- Fig.IR4:,N Flow-chart of design hierarchy from device to system. electron/hole, respectively. (e) shows the applied pulse amplitude In this paperdewesignexplore hierarcthe FerroFEThy frmemoryom based pro- modulation scheme. The states after each pulse are also illustrated. The cessing in memory (PIM) architecture in a hierarchical manner. initial state is assumed to be all polarizations are pointing toward the Fig. 6: FerroFET memory array schematic We have built thedevidesignce intoa sbottom-upystemfashion and a detailed gate. (f) shows the IDS-VGS characteristics after each pulse. (g) shows the measured drain to source conductance as a function of applied study and description of each level is provided. Fig. 4 provides Fig. 6: FerroFET cell schematic (a) conceptual pulse number. Here ideal case is presented, which shows linear and the flowchart of the entire design cycle from devices to the PIM (b) level implementation symmetrical potentiation and depression. architecture. We cannot discuss the design details of each step

FerroFET sub- Shift and

ADC R Array w/ B coeffients ADD E D D

Data from other subarrays A

Fig. 9. Compute unit schematic for non-uniform sampling algorithm

Fig. 9. Compute time and energy behavior of the Fig. 10. Reconstruction steps. (a) 1D compute unit versus DAC resolution and storage Example: Recovery of EEG Signal per FerroFET memory cell is Profile. (b) 2D Example: Brain (a) 1bit/cell (b) 2bits/cell (c) 3bits/cell and (d) Computed Topography Recovery. 4bits/cell

978-1-5386-3028-0/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on June 28,2021 at 14:04:12 UTC from IEEE Xplore. Restrictions apply.