Which is the Best Dual-Port SRAM in 45-nm Process Technology? – 8T, 10T Single End, and 10T Differential – Hiroki Noguchi†, Shunsuke Okumura†, Yusuke Iguchi†, Hidehiro Fujiwara†, Yasuhiro Morita†, Koji Nii†,††, Hiroshi Kawaguchi†, and Masahiko Yoshimoto† † Kobe University, Kobe, 657-8501 Japan. †† Renesas Technology Corporation, Itami, 664-0005 Japan. Phone: +81-78-803-6234, E-mail: [email protected]

read ports. The next section describes their topologies. Abstract— This paper compares readout powers and operating frequencies among dual-port SRAMs: an 8T SRAM, 10T II. CELL TOPOLOGIES single-end SRAM, and 10T differential SRAM. The conventional 8T SRAM has the least count, and is the most area A. 8T SRAM efficient. However, the readout power becomes large and the (a) cycle time increases due to peripheral circuits. The 10T Precharge Precharge circuit single-end SRAM is our proposed SRAM, in which a dedicated signal MC inverter and transmission gate are appended as a single-end read Bitline leakage port. The readout power of the 10T single-end SRAM is reduced by 75% and the operating frequency is increased by 95%, over the 8T SRAM. On the other hand the 10T differential SRAM can MC Memory cell (MC) operate fastest, because its small differential voltage of 50 mV RWL achieves the high-speed operation. In terms of the power WWL Readout current efficiency, however, the sense amplifier and precharge circuits lead to the power overhead. As a result, the 10T single-end P1 P2 Bitline keeper SRAM always consumes lowest readout power compared to the 8T and the 10T differential SRAM. N3 N4 N5 Compensation N6 N1 N2 Index Terms—SRAM, low power, non-precharge-type SRAM, two-port SRAM, differential port, video processing Readout WBL /WBL RBL

I. INTRODUCTION (b) “0” read “0” read “0” read “1” read As the ITRS Roadmap predicts, a memory area is becoming larger, and will occupy 90% of a by 2013 [1]. CLK For example, an H.264 encoder for a high-definition television requires, at least, a 500-kb memory as a search-window buffer, RWL which consumes 40% of its total power [2]. As process No power Precharge technology is scaled down, a large-capacity SRAM will be consumed adopted as a frame buffer and/or a restructured-image memory RBL on a video chip. The large-capacity SRAM potentially dissipates a larger portion of a total power, and dominates a circuit speed. Therefore, low-power and high-speed dual-port Power consumed on RBL Fig. 1. The conventional 8T dual-port SRAM. (a) A schematic and (b) SRAM is strongly required for video processing. In particular, waveforms in read operation. the power and operating frequency in a read operation is crucial since the readout takes place more frequently than The conventional dual-port SRAM cell comprised of eight write-in in a video codec. For instance in a motion estimation, (8T SRAM) [3] is shown in Fig. 1 (a). The 8T once picture data are written in memory, full-search SRAM frees a static noise margin (SNM) in a read operation algorithms or other motion compensation algorithms read out because it has a separated read port. Meanwhile, a precharge the data many times. circuit must be implemented on a read bitline (RBL) so that This paper compares three kinds of dual-port SRAMs in a the two nMOS transistors at the read port can sink a bitline 45-nm process technology. A dual-port SRAM is well utilized charge to the ground. Thus, a certain power is dissipated by for video processing because a read and write accesses are precharging (see Fig. 1 (b)). In addition to the precharge possible at the same time. As the three kinds of dual-port circuit, we have to prepare a bitline keeper on the RBL [4], SRAMs, we handle the conventional 8T SRAM, 10T SRAM which imparts negative influence on a readout time. To make with a single-end read port, and 10T SRAM with differential the matters worse, the delay overhead becomes larger as a

978-1-4244-1811-4/08/$25.00 © 2008 IEEE

supply voltage (VDD) decreases because of the bitline keeper. differential read bitlines (RBL and /RBL) [6]. Two nMOS transistors (N5 and N7) for the RBL and the other additional B. 10T Single End SRAM (10T-S SRAM) nMOS transistors (N6 and N8) for /RBL are appended to the To improve the 8T SRAM, we have proposed a 10T conventional 6T SRAM. As well as the 8T SRAM, precharge non-precharge SRAM with a single-end read bitline [5], as circuits must be implemented on the RBL and /RBL. depicted in Fig. 2 (hereafter, we call “10T-S SRAM”). Two (a) Precharge pMOS transistors are appended to the 8T SRAM cell, which signal Precharge results in the combination of the 6T conventional cell, an circuit MC inverter and transmission gate. The additional signal (/RWL) Bitline is an inversion signal of a read wordline (RWL); it controls the leakage additional pMOS transistor (P4) at the transmission gate. MC Memory While the RWL and /RWL are asserted and the transmission cell (MC) gate is on, a stored node is connected to an RBL through the RWL WWL Readout inverter. It is not necessary to prepare a precharge circuit current because the inverter fully charges/discharges the RBL. P1 P2 (a) N5 N3 N4 N6 MC N7 N8 N1 N2 Readout /RBL Memory cell (MC) MC RBL Inverter (P3 and N5) WBL /WBL Sense Amp. RWL WWL Transmission gate “0” read “0” read “0” read “1” read (P4 and N6) (b) P1 P2 P3 N6 CLK N3 N4 P4 N1 N2 N5 RBL RWL /RWL Readout 50mV WBL /WBL Precharge (b) “0” read “0” read “0” read “1” read RBL

Power consumed on RBL CLK Fig. 3. 10T SRAM with differential read bitlines (10T-D SRAM). (a) A schematic and (b) waveforms in read operation.

RWL Fig. 3 (b) depicts operation waveforms in the 10T-D SRAM Power consumed on RBL in read cycles. The differential bitlines must be precharged to VDD by the start time of a clock cycle. To correctly sense a RBL difference voltage between the RBL and /RBL, the difference No power consumed voltage must be, at least, more than 50 mV [7-9]. Fig. 2. The proposed 10T SRAM with a single-end read bitline (10T-S SRAM). (a) A schematic and (b) waveforms in read operation. III. SIMULATION RESULTS

Fig. 2 (b) illustrates operation waveforms in the 10T-S A. Cell and Macro Layouts SRAM in read cycles. A charge/discharge power on the RBL Fig. 4 portrays the layouts of the three kinds of dual-port is consumed only when the RBL is changed. Consequently, no SRAMs in a 45-nm process technology. The schematics were power is dissipated on the RBL if an upcoming datum is the already shown in the previous figures. The areas of the 8T, same as the previous state. The 10T-S SRAM is suitable for a 10T-S, and 10T-D SRAM cells are 1.55 × 0.41 µm2, 1.97 × real-time video image that has statistical similarity [5]. 0.41 µm2, and 1.95 × 0.41 µm2, respectively. The 10T-S SRAM reduces a bitline power in both cases that We also designed 64-kb SRAM macros in the 45-nm the consecutive “0”s and consecutive “1”s are read out. The process technology, for gate-level simulations. Fig. 5 shows charge and discharge power are consumed, only when a the macro layouts. The core sizes of the 8T, 10T-S, and 10T-D readout datum is changed. In the 10T-S SRAM, the transient SRAM macros are 260 × 443 µm2, 255 × 550 µm2, and 261 × probability on the RBL is 50% in a sequence of random data. 547 µm2, respectively. The 8T SRAM macro is the most area In contrast, in an image, adjacent pixels have strong efficient because of the least . The 10T-D correlation to one-another, so the transient probability is more SRAM macro has, compared to the 10T-S SRAM, a 2% area decreased than random data [5]. overhead due to differential sense amplifiers and precharge C. 10T Differential SRAM (10T-D SRAM) circuits. Fig. 3 (a) shows a schematic of a 10T SRAM with

978-1-4244-1811-4/08/$25.00 © 2008 IEEE

(a) L (a) VDD RBL GND GND WBL

/WB 443µm Address decoder N2 P2 N3 N5

µm Memory cell

41 block N4 P1 N1 N6 0. Read

260µm circuit 1.55µm Transistors N1 – N5, P1, P2 N6 Width / Length 0.1µm / 0.04µm 0.2µm / 0.04µm (b) Write VDD RBL RBL GND VDD GND WBL driver /WBL

N2 P2 N3 N6 P4

µm (b)

41 N4 P1 N1 N5 P3 550µm 0. Address decoder Memory 1.97µm cell Transistors N1 – N6, P1 – P4 block Width / Length 0.1µm / 0.04µm Read circuit (c) 255µm VDD RBL GND GND WBL /RBL /WBL

N7 N2 P2 N3 N6 µm Write

41 driver N5 N4 P1 N1 N8 0. 1.95µm (c) Transistors N1 – N6, P1, P2 N7, N8 547µm Address Width / Length 0.1µm / 0.04µm 0.2µm / 0.04µm decoder Fig. 4. Cell layouts of (a) 8T, (b) 10T-S, and (c) 10T-D SRAMs, in a 45-nm Memory process technology. cell block Sense amp.

B. Operating Frequency versus Supply Voltage 261µm To obtain an operating frequency, we carried out Monte Carlo simulations, considering threshold voltage variation of each transistor. The number of Monte Carlo samples is Write 20,000. driver Fig. 6 shows operating waveforms in the three kinds of

SRAMs. In the figure, we adopt the SS corner model to Fig. 5. Macro layouts of (a) 8T, (b) 10T-S, and (c) 10T-D SRAMs, in a simulate the worst-case delay. The followings are the criteria 45-nm process technology. The total memory capacity of each macro is 64 kb. to calculate the access times: (a) 1.0 • In the 8T SRAM, an access time is a period from a time RBL Access RWL at which an RWL rises to VDD/2 to a time at which an time Readout output of the sense amplifier is charged up to VDD/2. 50% of VDD Voltage [V] Voltage • In the 10T-S SRAM, an access time is a longer one: a 0 periods from a time at which an RWL rises to VDD/2 1.0 (b) RBL Access RWL to a time at which an output of the sense amplifier is time Readout charged up to VDD/2, or a period from a time at which 50% an RWL rises up to VDD/2 to a time that an output of of VDD Voltage [V] Voltage 0 Min the sense amplifier is discharged down to VDD/2. swing • In the 10T-D SRAM, an access time is a period from a (c) 1.0 width Access RBL time RWL Max time at which an RWL rises to VDD/2 to a time at /RBL swing width which a differential voltage between an RBL and /RBL 50% of VDD is expanded to 50 mV, 100 mV or 200 mV. [V] Voltage 0 In all the SRAMs, the worst cell with the worst 0.2 1.0 2.0 3.0 4.0 Time [ns] threshold-voltage variation determines the delay. In particular, Fig. 6. Operation waveforms of (a) 8T, (b) 10T-S, and (c) 10T-D SRAMs at in the 10T-D SRAM, even if the sense point is set to 50 mV, the SS corner (temperature = 25C°). most cells sink the bitline more than 50 mV.

978-1-4244-1811-4/08/$25.00 © 2008 IEEE

Fig. 7 shows the characteristics of the operating frequency The readout power in the 10T-S SRAM is 25 % lower than when VDD is changed. The operating frequency is calculated the 10T-D SRAM at the operating frequency of 294 MHz, as an inverse of a cycle time, which is a sum of a bitline when random data is considered. The saving factor is charge/discharge time plus propagation delays in decoder maximized to 63% if readout data has statistical similarly like circuits and a wordline. In this simulation of the operating H.264 reconstructed image data [5]. frequency, the precharge periods in the 8T and 10T-D SRAMs are not taken into account because it can be completely IV. SUMMARY overlapped with the decoder operation. In this paper, we clarified that the 10T SRAM with a At a supply voltage of 1.0 V, the 8T, 10T-S and 10T-D single-end read port is the best as a dual-port SRAM in a SRAMs can run at 294 MHz, 572 MHz and 755 MHz, 45-nm process technology. Although the conventional 8T respectively. The 10T-D SRAM can operate fastest. The small SRAM has the least transistor count, and is the most area differential voltage of 50 mV achieves the high-speed efficient, the readout power becomes large and the cycle time operation. The second fastest is the 10T-S SRAM. Although increases due to the keeper circuits on the read bitlines. The the additional transistor (P4) is appended in the 10T-S SRAM 10T differential-port SRAM can operate fastest. In terms of (see Fig. 2 (a)) and increases an RBL capacitance, the 10T-S the power efficiency, however, even if the sense point is set to SRAM is faster than the 8T SRAM because neither precharge 50 mV, most cells sink the bitline more than 50 mV, which circuit nor keeper circuit is needed. leads to the power overhead. As a result, the 10T single-end 800 8T SRAM 183MHz 755MHz SRAM always consumes the lowest readout power of the 10T-S SRAM faster three. 10T-D SRAM 600 572MHz ACKNOWLEDGMENT 461MHz faster 278MHz This work has been supported by Renesas Technology 400 faster Corporation. 294MHz

200 REFERENCES [1] International Technology Roadmap for 2005 Edition, http://www.itrs.net/Links/2005ITRS/Home2005.htm. Operating frequency frequency Operating [MHz] 0 [2] J. Miyakoshi, Y. Murachi, K. Hamano, T. Matsuno, M. Miyama, and M. 0.78V 0.89V 0.5 0.6 0.7 0.8 0.9 1 Yoshimoto, “A Low-Power Systolic Array Architecture for Supply voltage [V] Block-Matching Motion Estimation,” IEICE Transactions on Fig. 7. Operating frequencies when a supply voltage is changed. Electronics, vol.E88-C, no.4, pp.559-569, April 2005. [3] L. Chang, D.M. Fried, J. Hergenrother, J.W. Sleight, R.H. Dennard, R.K. Montoye, L. Sekaric, S.J. McNab, A.W. Topol, C.D. Adams, K.W. C. Power Guarini, and W. Haensch, “Stable SRAM cell design for the 32 nm node and beyond,” IEEE Symposium on VLSI Technology Digest of Technical Fig. 8 compares the readout powers in the 8T, 10T-S, and Papers, pp.128-129, June 2005. 10T-D SRAMs. Note that VDD is changed in the lines, [4] R. K. Krishnamurthy, A. Alvandpour, G. Balamurugan, N. R. Shanbhag, according to Fig. 7. The 10T-S SRAM is not the fastest but the K. Soumyanath, and S. Y. Borkar, “A 130-nm 6-GHz 256 × 32 bit lowest power. This is because the transition possibility of the leakage-tolerant register file,” IEEE Journal of Solid-State Circuits, vol.37, no.5, pp.624-632, May 2002. RBL is 50% when a sequence of random data is considered. [5] H. Noguchi, Y. Iguchi, H. Fujiwara, Y. Morita, K. Nii, H. Kawaguchi, On the other hand, in the 10T-D SRAM, the average voltage and M. Yoshimoto, “A 10T Non-Precharge Two-Port SRAM for 74% difference between the RBL and /RBL is 80% of VDD (= 0.8 Power Reduction in Video Processing,” Proceedings of IEEE Computer Society Annual Symposium on VLSI, pp.107-112, May 2007. V) as mentioned in the previous subsection, even if the sense [6] N. Shibata, H. Kiya, S. Kurita, H. Okamoto, M. Tan'no, T. Douseki, “A point in the worst-case is set to 50 mV. 0.5-V 25-MHz 1-mW 256-kb MTCMOS/SOI SRAM for solar-power-operated portable personal digital equipment - sure write 1 0.97mW 8T SRAM (1V) operation by using step-down negatively overdriven bitline scheme,” Random IEEE Journal of Solid-State Circuits, vol.41, no.3, pp.728-742, March 10T-S SRAM Image 25% 63% 0.8 lower lower 2006. 10T-D SRAM [7] N. Verma, A. P. Chandrakasan, “A 256 kb 65 nm 8T Subthreshold 0.64mW (0.78V) SRAM Employing Sense-Amplifier Redundancy,” IEEE Journal of 0.6 Solid-State Circuits, vol.43, no.1, pp.141-149, January 2008. [8] R.E. Aly, M.A. Bayoumi, M. Elgamel, “Dual sense amplified bit lines 0.48mW 0.4 (0.89V) (DSABL) architecture for low-power SRAM design,” Proceedings of IEEE International Symposium on Circuits and Systems 2005, vol.2, 0.24mW pp.1650-1653, May 2005. 0.2 (0.89V) [9] S. Ohbayashi, M. Yabuuchi, K. Nii, Y. Tsukamoto, S. Imaoka, Y. Oda, Readout[mW] power T. Yoshihara, M. Igarashi, M. Takeuchi, H. Kawashima, Y. Yamaguchi, 10T-S SRAM 294MHz 0 K. Tsukamoto, M. Inuishi, H. Makino, K. Ishibashi, H. Shinohara, “A 0 50 100 150 200 250 300 65-nm SoC Embedded 6T-SRAM Designed for Manufacturability With Operating frequency [MHz] Read and Write Operation Stabilizing Circuits,” IEEE Journal of Fig. 8. Readout power versus operating frequencies in a 45-nm process Solid-State Circuits, vol.42, no.4, pp.820-829, April 2007. technology.

978-1-4244-1811-4/08/$25.00 © 2008 IEEE