A Fefet Based Processing-In-Memory Architecture for Solving Distributed Least-Square Optimizations
Total Page:16
File Type:pdf, Size:1020Kb
A FeFET Based Processing-In-Memory Architecture for Solving Distributed Least-Square Optimizations Insik Yoon1, Muya Chang1, Kai Ni2, Matthew Jerry2, Samantak Gangopadhyay1, Gus Smith3, Tomer Hamam1, 3 1 4 2 1 Vijaykrishnan Narayanan , Justin Romberg , Shih-Lien Lu , Suman Datta and Arijit Raychowdhury 1Georgia Institute of Technology, Atlanta, GA 30318, USA, 2University of Notre Dame, Notre Dame, IN 46556 USA 3Pennsylvania State University, State College, PA 16801 USA, 4TSMC, Taiwan Email: [email protected] /Phone: (678)643-5264 Introduction: HfO2 based ferroelectric FET (FeFET) has recently received great interest for its application in nonvolatile memory (NVM) [1]. Unlike conventional perovskite based ferroelectric materials, HfO2 is CMOS compatible and retains ferroelectricity for thin film with thickness around 10 nm. Therefore, successful integration of ferroelectric HfO2 into advanced CMOS technology makes this technology highly promising for NVM [1]. Moreover, by tuning the portion of switched ferroelectric domain, a FeFET can exhibit multiple intermediate states, which enables its application as an analog conductance in mixed-signal in-memory computing. Currently, such architectures have been applied to neuromorphic computing [2,3]. In this paper, we present a processing-in-memory (PIM) architecture with FeFETs and demonstrate how this can be used to solve a new class of optimization problems, in particular, distributed least square minimization. Device Operation and Modeling: The operation of FeFET is illustrated in Fig. 1. A positive/negative gate pulse erase/program the device by setting the ferroelectric polarization to point toward the channel/gate, which would set the device in low/high VTH state, respectively. The corresponding IDS-VGS characteristics for the two states are shown in Fig. 1(c). The separation of the two states indicate the memory window (MW), which is around 1.0 V for 10 nm HfO2 FeFET [1]. By applying a read pulse, VR, the two states can be differentiated by sensing the read current. The operation of FeFET as an analog conductance is different from the binary memory, in that a series of weak pulses are applied to set the device in desired state [2]. Out of the various pulse schemes proposed in [3] to tune the state, we use the pulse-amplitude modulation scheme (Fig. 2). Fig. 2(e) shows the applied pulse waveform. After each pulse, the percentage of switched ferroelectric domains is modified. The device states are shown in Fig. 2 (a)-(d). The device IDS-VGS corresponding to different states are shown in Fig. (f), which shows the intermediate states. The different states could be sensed by applying a read pulse, VR, the corresponding drain-to-source conductance, GDS, can be sensed. Fig. 2(g) shows the ideal GDS as a function of applied pulse numbers. GDS increases/decreases linearly and symmetrically with pulse numbers during potentiation/depression, respectively. The modeling framework for FeFET is shown in Fig. 3(a), which is composed of two sub-components: (1) baseline 45nm MOSFET, which is models QMOS(VMOS) in BSIM 4, and (2) the ferroelectric which models QFE(VFE) using the dynamic Preisach model. Fig. 4(a) and (b) show the measured and simulated ID-VG characteristics under potentiation and Fig. 5 shows FeFET channel conductance vs. the number of pulses from both simulations and experiments. Architecture and Application: In Fig. 6, we present the cell schematic of a differential FeFET memory cell that computes MAC on positive and negative numbers. One number is stored as a differential value of conductance in two FeFETs and the other number is applied as a differential input voltage to BL1 and BL2. As WL is asserted, differential current flows to the SL and an ADC digitizes the result. The vector is mapped to input voltage in BL and numbers in matrix are mapped to conductance in FeFET array. Fig.7 describes the end-to-end and vertically integrated model from device to architecture. We use this architecture to demonstrate distributed least-square optimization (with wide applications to signal processing, machine learning etc.) which solves an inverse problem of the form Az=B where A is the Grammian matrix of the basis and B is the observation vector [4]. We show that this problem can be mapped to iterative and distributed PIM where each unit is shown in Fig. 8. In Fig. 9, we compare compute latency and energy dissipation from a compute unit with FeFET memory vs. DAC resolution and number of bits per FeFET cell. The figure also shows the design-space exploration of ADC, DAC bit resolution and number of bits per FeFET cell in terms of latency and energy dissipation. At a system level, this results in 21X (3X) improvement in energy-efficiency (performance) compared to an SRAM+ALU architecture. We use the proposed algorithm and architecture for solving a classic problem of signal reconstruction from non-uniform samples: (1) Signal reconstruction from 1D EEG Signals and (2) Recovery of CT Images used in medical imaging in Fig. 10. We note more than 35dB PSNR after signal reconstruction. Conclusion: In this paper, we present a FeFET based processing in memory architecture and demonstrate its application to solving distributed least squares optimizations. References: [1] H. Mulaosmanovic et al., VLSI Tech. Dig., 2017. [2] S. Oh, et al., IEEE Electron Device Letters, pp. 732-735, 2017. [3] M. Jerry, et al., IEDM, 2017. [4] Ferreira et. al., Signal Processing, 1994. 978-1-5386-3028-0/18/$31.00 ©2018 IEEE Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on June 28,2021 at 14:04:12 UTC from IEEE Xplore. Restrictions apply. (a) (e) (a) Model Components ERS G b c d V (a) S D d c b a G G n+ n+ p-Si (f) Preisach Model S D VFE QFE (VFE) (b) G n+ n+ S D S VG = VFE + VMOS (c) D n+ n+ I QFE (VFE) = QMOS (VMOS) p-Si p-Si BSIM 4 VR VMOS (c) QMOS (VMOS) S G ψS PGM D S D (g) VGS I Here b is the sample vector, z is the coefficient vector obtained (b) n+ n+ a G by stacking the coefficients ↵ (kx , ! x ,ky , ! y ), and A isreferred p-Si (b) Preisach Model Components MW to as the Grammian (Gram)S matrix of the basis. S D VR D (d) GWhen the size of A matrix islarge (as in most applications)VFEa Veff QFE G Static Preisach n+ n+ S D Delay Unit n+ direct solutionnis+not possible. Therefore, alternatively wefollow Model VGS an iterative approach, the Jacobi method. A general update of dVeff VFE - Veff P (V ) = mF(V ) + b : inner loop p-Si pz-Sini j th component at the k iteration is given as (4), where = FE eff eff th dt τ F(Veff) = Qotanh[a(Veff - VC)]: saturated loop B = A T A and c = A T b. Pulse num X Fig. 1. (a) Erase and (b) program operation Fig. 2. (a) – (d) Fek EFT− 1 states w.r.tk − 1 Fig. 3. (a) Components of compact model zj = B jj (cj − B jizi ) (4) Fig. 3: (a) simulated FerroFET channel conductance (b)Measured of binary FeFET memory. (c) Device IDS- ferroelectric domain switchingi =6 j . (e) for FeFEFerrToFET andchannel(bconductance) the dyna(GDS) asmica functionPreisachof pulse number. Some observations are worth emphasizing: (1) To update z k , j and pulse-amplitude modulation scheme [10]. For illustration, VGS characteristics after the program/erase. pulse amonlyplitvaluesudefrommodulpreviousatiiterationon scareheneed.me.(2) Columnsmodeofl components which is composed of a Fig. 2 illustrates the operation with pulse-amplitude modulation A are coupled only with neighboring frames, which leads to The separation of the two curves is the (f) IDS-VGS characteristics after each delay unischeme,t andwhichtheisstausedticin thisPrpapereisach. Fig. 2(e) modeshowsl. the applied simpler computation of B . Such a system maps naturally to ji pulse waveform. After each pulse, the percentage of switched a systolic PIM architecture with (1) near neighbor connections memory window. pulse. (g) measured w.r.t applied pulse ferroelectric domains is modified. The device states are shown and (2) embedded linear algebraic operators on the periphery number. in Fig. 2 (a)-(d).PoThetentdeiavicetionIDS-VGS correspondingDepressionto different -5 10-5 of the subarray – asPwillotenbetiadescribedtion in theDepfolloresswingion sections. 10 Simulation 3.00E-04 states are shown in Fig. 2 (f), which shows the intermediate Experiment 16 states V : 1.0V to 4.0V III. FERROFETS -MODELING AND EXPERIMENTAL states. The different states could be sensed16 sbytateapplyings a read V : 2.85V to 4.45V ERS E RS V ERIFI CATION with with ] 2.50E-04 pulse, VR, the corresponding drain-to-source conductance, GDS, -6 Step = 0.1 V Step = 0.05 V S linearity linearity 10 [ -6 HfO2 based ferroelectricSimulatFETion (FerroFET) has recently re- can be sensed. Fig. 2 (g) shows the ideal GDS as a function e 10 f r f r o c 2.00E-04 e o e ceived great interest for its application in nonvolatile memory of applied pulse numbers. GDS increases/decreases linearly ) Experimental e b b n s e a ) m a A s m with pulse number during potentiation/depression, respectively. u (NVM)t [8]. It is CMOS compatible and retains ferroelectricity e ( Data A -7 r n a u 1.50E-04 c ( c e D 10 n e r A symmetrical potentiation/depression is necessary for high n foru thin film with thickness around 10 nm. By tuning the portion I D I s l c e I -7 u n s d p I l of switched ferroelectric domain, a FerroFET can exhibit mul- accuracy computation. The experimental procedure is outside n 1.00E-04 10 u p tipleo intermediate states, which has been used in neuromorphic the scope of this paper and is described in [11].