<<

Power Performance of IGZO DRAM memory

A Thesis

Presented to the Faculty of the School of Engineering and Applied Science

University of Virginia

In Partial Fulfillment

of the requirements for the Degree

Master of Science (Electrical Engineering)

by

Mandi Das

May, 2019 APPROVAL SHEET

This Thesis is submitted in partial fulfillment of the requirements for the degree of Master of Science

Author Signature:

This Thesis has been read and approved by the examining committee:

Advisor: Mircea R. Stan

Committee Member: Kyusang Lee

Committee Member: Steven M. Bowers

Committee Member:

Committee Member:

Committee Member:

Accepted for the School of Engineering and Applied Science:

Craig H. Benson, School of Engineering and Applied Science

May 2019 ○c Copyright by Mandi Das 2019 Abstract

C-axis aligned crystal In-Ga-Zn-oxide (CAAC-IGZO) is a novel crystal morphology discovered by Semiconductor Energy Laboratory which is different from other single- crystal, poly-crystalline and amorphous morphologies. It is neither single-crystal or amorphous with no clear distinct grain boundary making this material viable candi- date for thin film transistor. One particular property of interest is its high current-on to current-off ratio which enables extremely low power consumption and can be imple- mented as a virtual non- for emerging memory technologies. In this thesis we work on a design strategy to replace SRAM with the IGZO 1T1C DRAM as last-level on-chip cache for microprocessors and study both the circuit level simulation and power performance at the system level.

i Acknowledgments

I would like to thank Dr. Mircea R. Stan for providing me with an opportunity to work on this fascinating project. I am able to reach this point because of his tremendous support and guidance. I’ll cherish the experiences and memories I had during my stay at UVA. I’m thankful for the honourable presence of my committee members (Dr. Mircea R. Stan, Dr. Steven M. Bowers, Dr. Kyusang Lee) for making my defense a wonderful experience and shaping me as a researcher and engineer. Lastly, I would like to thank my fellow members of HPLP who have always sup- ported and motivated me as a friend.

ii Contents

1 Introduction1 1.1 Motivation...... 2 1.2 Contribution...... 3 1.3 Organization...... 3

2 Background Information4 2.1 Memory...... 4 2.2 SRAM...... 5 2.3 DRAM...... 6 2.4 ...... 6 2.5 CAAC-IGZO...... 7 2.6 Retention Latch...... 8 2.7 NOSRAM and DOSRAM...... 10 2.7.1 NOSRAM...... 10 2.7.2 DOSRAM...... 11 2.8 Summary...... 14

3 Circuit analysis 15 3.1 Device model...... 15 3.2 Circuit implementation of the device model...... 15 3.3 Results...... 17

4 Cache Architecture Study 19

iii 4.1 CACTI...... 19 4.2 Implementation...... 19 4.3 Results...... 22

5 Multi-core Full-system simulation 24 5.1 GEM5...... 24 5.2 Implementation...... 24 5.3 Benchmark...... 26 5.4 McPAT...... 27 5.5 Implementation...... 28 5.6 Power model analysis...... 28

6 Conclusion 31

7 Future Works 32

iv List of Figures

2-1 [4]...... 4 2-2 A six-transistor CMOS SRAM cell...... 5 2-3 1T1C DRAM cell...... 6 2-4 CAAC morphology [13]...... 7 2-5 Stacked structure of OS FET and Si FET [5]...... 8 2-6 I-V characteristics of CAAC-IGZO and Si transistors [1]...... 8 2-7 I-V curve of IGZO below subthreshold [1]...... 9 2-8 Typical latch to latch Synchronous design...... 9 2-9 Retention latch with 1T1C IGZO element...... 10 2-10 1T1C circuit for the backup element...... 10 2-11 Retention period for various values of Vg [1]...... 11 2-12 (a) Circuit diagram. (b) Cross-sectional view. [5]...... 12 2-13 SRAM with IGZO back-up circuit [13]...... 12 2-14 Topology of the SRAM memory cell with backup circuit [13]..... 13 2-15 1T1C IGZO DRAM [13]...... 13 2-16 (a) Without stack structure. (b) With the stack structure. [11].... 13

3-1 IGZO DUT for simulations [13]...... 16 3-2 I-V graph of the IGZO verilog-a model...... 16

4-1 Cache model used in CACTI [14]...... 20

5-1 A system configuration with a two-level cache hierarchy[8]...... 25 5-2 Mobile full-system structure [9]...... 26

v 5-3 Block diagram of the McPAT framework [6]...... 28 5-4 Implementing to the McPAT framework...... 29 5-5 Power components during sleep...... 29 5-6 Power components during active...... 30

vi List of Tables

3.1 Retention period calculation with C = 10fF...... 17 3.2 Retention period calculation with C = 1fF...... 17 3.3 Retention period calculation with C = 0.2fF...... 17

4.1 IGZO-FET parameters for CACTI...... 21 4.2 Cache organization parameters...... 22 4.3 Timing and power for different memory types...... 22 4.4 Timing and refresh power...... 23

5.1 Mobile full-system configuration [9]...... 27

vii Chapter 1

Introduction

With emerging technologies, scaling down devices to nanotechnology made its way to handheld high performance gadgets. Today majority of the global population are accessing internet service and many of them are connected through battery operated devices. A 2015 GlobalWebIndex survey showed that 80% of internet users own a smartphone, almost half own a tablet, and the majority of these almost two trillion mobile devices run the Android Operating System. Consumer devices, which include smartphones, tablets and wearables, have seen a tremendous growth in the last decade because of portability and performance. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. Designing low power consuming devices are of foremost challenges in current technologies. In modern SOCs, the major component that consumes most of the power and area are the on-chip cache memories. SRAM consume a lot of power due to both static and dynamic power dissipation. DRAM additionally needs to refresh the data because of the degrading storage capacitor. There has been active research going on for non-volatile memories and have the advantage in terms of low-voltage operation and high-speed data writing and reading. But these memories have limited endurance

1 and low on/off ratio. Semiconductor Energy Laboratory developed a new crystal structure for oxide semiconductor known as C-Axis Aligned Crystalline In-Ga-Zn Oxide thin-film tran- sistor (CAAC-IGZO TFT) [5]. This device has very low parasitic leakages and low channel leakage. Because of the nature of the crystalline structure, there is no distinct boundary in the junction, making it a high endurance device. And since it has very low leakage, it can be designed to be used as a nonvolatile memory. In this thesis, we study the structures of Nonvolatile Oxide Semiconductor Ran- dom Access Memory (NOSRAM) [5] and Dynamic Oxide Semiconductor Random Access Memory (DOSRAM) [1]. Evaluate the parameters of the CAAC-IGZO FET that would enable us to select the feasible type of memory and hours of retention period. And then use open-source tools: CACTI, gem5 and McPAT, to study the timing and power savings using the IGZO device as cache memory.

1.1 Motivation

Dynamic Random Access Memory is a volatile memory which needs to be peri- odically refreshed (typically 64ms [12]) to retain the data in the storage element. If not refreshed, the data is lost from the storage element because of the leakages of the access transistor to the stored bit. This frequent refresh of data is a source of power consumption in DRAM. CAAC-IGZO TFT has very low off-state current making it an extremely low-leakage device. This device can be used to replace the access tran- sistor in a bit cell that requires a larger refresh period, thereby reducing the power consumption of DRAM memory.

2 1.2 Contribution

∙ HSPICE was used to simulate the IGZO-TFT verilog-a model and evaluate the optimum range of operating voltage with low static power dissipation.

∙ Implementing the IGZO-DRAM as last-level cache memory in a full-system architecture to evaluate the timing and the power performance.

1.3 Organization

The rest of the thesis is organized as follows:

∙ Chapter 2 provides the background information of different memory types and the motivation to use IGZO based memory.

∙ Chapter 3 studies the characteristics of the IGZO verilog-a model using HSPICE and find the range of operating voltage for longer retention period.

∙ Chapter 4 implements the cache structure with the IGZO-DRAM as last-level cache memory in CACTI and studies the access and cycle timings.

∙ Chapter 5 implements the IGZO-DRAM as last-level cache in a full-system architecture using GEM5 and studies the power performance using McPAT.

∙ Chapter 6 summarizes this thesis.

∙ Chapter 3 provides future research works after this thesis.

3 Chapter 2

Background Information

2.1 Memory

Memory is essential for the operation of a computer system and intense research and development are employed for the performance of the fastest component, the cost per bit of the cheapest component and the energy consumption of the most energy-efficient component. For years, the use of a memory hierarchy has beenvery convenient, in that it has simplified the process of designing memory systems. The use of a hierarchy allowed designers to treat system design as a modularized process comprises of individual subsystems (caches, DRAMs, flash) [4].

Figure 2-1: Memory Hierarchy [4]

4 Fig 2-1 shows the memory hierarchy where level 1 and level 2 are usually cache memories. They are faster memories closer to the processor core, which stores copies of the data from frequently used main memory locations. The L1 cache is split into L1I(instruction) and L1D(data) and L2 cache is usually shared for a multi-core processor. On-chip cache memories are larger and the cost per bit is expensive. The next level is the main-memory which is slower than cache memory but they have high packing density reducing the cost per bit. The last level is generally the non-volatile storage devices like disks, flash drive.

2.2 SRAM

SRAM stands for Static Random Access Memory. It is typically used for CPU cache as they are faster than any other memories and does not require to be refreshed periodically. It is volatile in the conventional sense that data is eventually lost when the memory is not powered. Fig. 2-2 shows the schematic of a 6T SRAM memory cell. The cost per bit is high since it occupies a larger area to store a single bit. And the power consumption is higher because the cross-coupled inverters are always on in order to retain the data.

Figure 2-2: A six-transistor CMOS SRAM cell

5 2.3 DRAM

DRAM stands for dynamic random access memory. Dynamic refers to the need to periodically refresh DRAM cells so that they can continue to retain the stored bit. Because of the small footprint of a DRAM cell, DRAM can be produced in large capacities thereby reducing the cost per bit. By packaging DRAM cells judiciously, DRAM memory can sustain large data rates. For these reasons, DRAM is used to implement the bulk of main memory. Fig. 2-3 shows a 1T1C DRAM memory cell. The data bit is stored in the capacitor as charge and the transistor is used to access the data.

Figure 2-3: 1T1C DRAM cell

2.4 Flash memory

Flash memories are non-volatile and can hold data even without power. They are significantly slower than RAM and are primarily used for storage. They have limited numbers of erase and write cycles

6 2.5 CAAC-IGZO

CAAC-IGZO [10] is a novel crystal morphology discovered by Semiconductor En- ergy Laboratory which is different from other single-crystal, polycrystalline and amor- phous morphologies. It is neither single-crystal or amorphous with no distinct clear grain boundary. The characteristics of this material made it a viable candidate for thin film transistor. It has low short-channel effect, high current-on to current-off ratio which enables extremely low power consumption, high frequency of operation, better high-temperature and drain voltage tolerance and several other features making it an ideal nMOS device. CMOS can be implemented in conjunction with Si-pMOS.

Figure 2-4: CAAC morphology [13]

One important feature of CAAC-IGZO is that it has an extremely low off-state current. In Fig. 2-6, we see the behaviour of the I-V curve of this device against Si below the subthreshold region. This feature is utilized for designing of Non volatile Oxide Semiconductor Random Access Memory(NOSRAM) [13] and Dynamic Oxide Semiconductor Random Access Memory(DOSRAM) [13]. Crystalline oxide semiconductor has been successfully implemented in OLED and LCD displays which enables to refresh the display less often when displaying a static image resulting in less power consumption. There are also research going on in the field of LSI devices like image sensor, memory, CPUs andAI. In case of CAAC-IGZO FET, the device can be fabricated in BEOL resulting in smaller area footprint [1]. IGZO FET has a low off-state current(Ioff) of around 10−24퐴/푚, which is 15 times lower than a Si FET making it a very low-leakage device, almost non-volatile. This property is implemented in designing of retention latch, SRAM and DRAM [10].

7 Figure 2-5: Stacked structure of OS FET and Si FET [5]

Figure 2-6: I-V characteristics of CAAC-IGZO and Si transistors [1]

2.6 Retention Latch

1T1C backup element with CAAC-IGZO was implemented in a retention latch to study the power savings. In a typical latch to latch synchronous design as shown in Fig. 2-8, power saving is done through clock-gating and power-gating. But there are leakage power consumption when the combinatorial logic path is turned off and the

8 Figure 2-7: I-V curve of IGZO below subthreshold [1] latches hold on to a state.

Figure 2-8: Typical latch to latch Synchronous design.

In order to save power, a 1T1C backup element was connected to the latches as shown in Fig. 2-9. Here the latches can be completely turned off and can regain the state when the backup element is turned on. Because of the low leakage of IGZO FET, the storage capacitance can have lower value and the backup element can have long retention period. With this, fast save and reset latches can be obtained resulting in fine power gating granularity. This enables normally-off/normally-on architectures. The retention time for the 1T1C can be approximated by:

푉 푉푖 · 퐶푠 − 푔 푇 = 10 푆푆 퐼표푓푓

9 Figure 2-9: Retention latch with 1T1C backup IGZO element.

Where 푉푖 is the storage voltage across the capacitor, 퐶푠 is the storage capacitor, 퐼표푓푓 is the off-current of the IGZO FET, 푉푔 is the voltage of the gate under-drive and 푆푆 is the subthreshold slope of the device.

Figure 2-10: 1T1C circuit for the backup element.

The retention period grows exponentially with gate under-drive and linearly with storage capacitance size.

2.7 NOSRAM and DOSRAM

2.7.1 NOSRAM

Non-volatile Oxide Semiconductor Random Access Memory [13] was implemented with IGZO FET which utilizes the low off-state current. The resulting circuit had high endurance, faster read/write speeds and several days of retention [5]. The NOSRAM has the potential as nonvolatile memory that

10 Figure 2-11: Retention period for various values of Vg [1] surpasses other nonvolatile memories under development in terms of on/off ratio and endurance. The other design is of a 6T SRAM comprising the back-up circuit with the IGZO FET device. The backup circuit is connected to the inverter loop and has two IGZO FET and two storage capacitors as shown in Fig. 2-13. The backup circuit is stacked on top of the memory cell (Fig. 2-14) resulting in no area overhead. This can be used to replace L1 cache memories. Here the storage element can be put to sleep with the data retained in the back-up circuit and does not require to repopulate cache after the wake-up signal. Thus saving overall power.

2.7.2 DOSRAM

Dynamic Oxide Semiconductor Random Access Memory [13] is conceptually sim- ilar to 1T1C Si DRAM. The access transistor is replaced with the IGZO device with extremely low leakage. The storage capacitor retains the state for a longer period of time. This results in a larger refresh interval because of the high retention period.

11 Figure 2-12: (a) Circuit diagram. (b) Cross-sectional view. [5]

Figure 2-13: SRAM with IGZO back-up circuit [13]

12 Figure 2-14: Topology of the SRAM memory cell with backup circuit [13]

Here, rather than the traditional substrate trench capacitor, mim based capacitor is used.

Figure 2-15: 1T1C IGZO DRAM [13]

Since the memory element is in BEOL, the peripheries can be stacked under the arrays which reduces the area overhead. Furthermore, this shortens the bitline which reduces the capacitance of the BL resulting in faster access.

Figure 2-16: (a) Without stack structure. (b) With the stack structure. [11]

13 2.8 Summary

CAAC-IGZO FET has a very low leakage which can be utilized as a back-up circuit for SRAM for saving power when the data in the storage element is not accessed. And implementing CAAC-IGZO FET as the access transistor in a 1T1C DRAM cell, it becomes virtually nonvolatile memory. Considering architecture-based research studies, implementing DOSRAM or IGZO- DRAM seemed to be more feasible in prospect to power and area saving. Moreover there is no change in area of the SRAM after implementing the backup circuit whereas in DRAM there is negative area overhead because the peripherals can be stacked un- der the memory array resulting in more denser and larger array. This makes it more cost-effective.

14 Chapter 3

Circuit analysis

3.1 Device model

HSPICE was used as the simulation tool because of its versatility in circuit sim- ulation capabilities and because of its capability to perform optimization of device model parameters to fit measurement data. The verilog-a-a device model which we used for our simulations is from SAMSUNG. The device modelling is similar to that of an nFET which is modified to accommodate the ac and dc characteristics ofthe CAAC-IGZO FET making it feasible for simulation and fabrication.

3.2 Circuit implementation of the device model

Spice simulations were done to validate the electrical characteristics of the CAAC- IGZO FET verilog-a model. The design under test was the CAAC-IGZO FET and dc analysis was done by sweeping dc across the gate under-drive. From previous work [1], we saw that the off-state current for the transistor is around 100 × 10−24퐴/휇푚 and the storage capacitance was 32fF. In order to achieve that off-state current the gate of the transistor needs to be a negative voltage. Fig 3-2 shows the I-V characteristic of the CAAC-IGZO FET. The FET used for the measurement has a channel length of 10 푛푚 and channel width of 10 푛푚. The dc sweep for the gate under-drive was from -1푉 to 1푉 . This gives us an 퐼표푓푓 푐푢푟푟푒푛푡

15 Figure 3-1: IGZO DUT for simulations [13]

of 2.34 × 10−24퐴. Calculating the retention period for the simulated off-state current gives us years of retention. We observed that the retention period was linearly related to the storage capacitance value. With the use of BEOL capacitors, we could have used higher value of capacitor for longer retention period, but we used a nominal value of 10fF to keep the operation of the device faster.

Figure 3-2: I-V graph of the IGZO verilog-a model.

Based on the current battery technology of the IOT/mobile devices, a retention period of days was opted for the design and simulations. The corresponding voltage for the gate of the access transistor is 0.8푉 and -0.4푉 for 퐼표푛 and 퐼표푓푓 .

푉 푉푖퐶푠 − 푔 푇 = 10 푆푆 퐼표푓푓

16 −17 Where 푉푖 = 0.8 푉 , 퐶푠 = 0.2fF, 퐼표푓푓 = 4.7×10 퐴, 푉푔 = -0.4 푉 and 푆푆 = 80 푚푉/푑푒푐 The calculated retention time from the above equation is around 4.5 days which is a reasonable refresh interval for the memory designs as opposed to the current technology with refresh interval of around 64 ms.

3.3 Results

In the above section, we used HSPICE to study the I-V characteristics of CAAC- IGZO FET verilog-a model. In order to find a suitable range of voltage for the gate under-drive, we ran several simulations as shown in Table 3.1, 3.2 and 3.3. Each table is for different values of the storage capacitance.

푉 퐷퐷(표푛) 푉 퐷퐷(표푓푓) 퐼표푛(퐴) 퐼표푓푓 (퐴) Ratio Retention period 0.8V 0V 11.77E-06 4.13E-12 4.3E+05 1.93 ms 1.6V 0V 9.03E-06 4.04E-11 2.2E+05 0.4 ms 0.8V -0.1V 1.77E-06 2.34E-13 7.53E+6 607 ms 0.8V -0.2V 1.77E-06 1.32E-14 11.34E+8 191 s 0.8V -0.3V 1.77E-06 7.4E-16 2.39E+9 17 Hours 0.8V -0.4V 1.77E-06 4.17E-17 4.17E+10 7 Months 0.8V -0.5V 1.77E-06 2.34E-18 7.6E+11 192 years

Table 3.1: Retention period calculation with C = 10fF

푉 퐷퐷(표푛) 푉 퐷퐷(표푓푓) Retention period 0.8V -0.3V 1.68 Hours 0.8V -0.4V 22 Days

Table 3.2: Retention period calculation with C = 1fF

푉 퐷퐷(표푛) 푉 퐷퐷(표푓푓) Retention period 0.8V -0.3V 20 mins 0.8V -0.4V 4.5 Days

Table 3.3: Retention period calculation with C = 0.2fF

From the simulations, we observe that the optimal range of voltage for the gate- under-drive of the 1T1C IGZO-DRAM is -0.4 to 0.8 which gives us a retention period

17 of 4.5 days compared to the 64ms refresh interval of a traditional DRAM.

18 Chapter 4

Cache Architecture Study

4.1 CACTI

CACTI [14] by HP Labs is an open source tool used for high-level modeling of cache and memory. It calculates access time and cycle time of hardware caches which is accurate to within 10% when compared with HSPICE model. The tool uses an analytical model to estimate delay down both tag and data paths to determine the best configuration for a given cache size, block size, and associativity. For agiven set of input parameters, the tool performs a detailed design space exploration across different array organizations and on-chip interconnects, and outputs a design that meets the input constraints. Fig. 4-1 shows the basic memory architecture and the model comes with all the basic peripheral circuits like decoders, sense amplifier, wire models, SRAM/DRAM cells. CACTI models both Uniform Cache Access and Non-Uniform Cache Access using SRAM and DRAM of which it can compute the delay, power and area. The tool uses the ITRS model parameters for all the devices used for the analysis.

4.2 Implementation

The IGZO-DRAM has a memory cell structure similar to DRAM, i.e., each mem- ory cell simply contains one element and one nMOS transistor for access

19 Figure 4-1: Cache model used in CACTI [14]

control. Moreover, since both IGZO-DRAM and DRAM cells have a three-terminal cell interface, IGZO-DRAM cell array has a structure very similar to that of DRAM cell array. Therefore, we modified the CACTI DRAM model to obtain the IGZO- DRAM cache model. The peripheral circuits like the decoders, sense amps, compara- tors, etc. remain the same.

In order to accommodate the new IGZO-DRAM in CACTI tool, first the techno- logical parameters of the access transistor of the regular DRAM is updated. Table 4.1 shows all the parameters that are updated as per the transistor model of the IGZO device.

Then, the area of part of the entire chip occupied by the peripheral circuits can be decreased by stacking the memory cell array on the peripheral circuits resulting in

1 negative area overhead, thereby the 퐶푏푖푡푙푖푛푒 is reduced by 5 [1] resulting in faster access time. The capacitance of the bitline of the IGZO-DRAM is given by the following

20 equation [14]:

1 푁 퐶 = 푠푢푏푎푟푟−푟표푤푠 퐶 + 푁 · 퐶 푏푖푡푙푖푛푒 5 2 푑푟푎푖푛−푐푎푝−푎푐푐−푡푟푎푛푠푖푠푡표푟 푠푢푏푎푟푟−푟표푤푠 푏푖푡−푚푒푡푎푙

where 퐶푏푖푡푙푖푛푒 = capacitance of bitline, 푁푠푢푏푎푟푟−푟표푤푠 = number of subarray rows,

퐶푑푟푎푖푛−푐푎푝−푎푐푐−푡푟푎푛푠푖푠푡표푟 = drain capacitance of the IGZO TFT and 퐶푏푖푡−푚푒푡푎푙 = ca- pacitance of the metal. Lastly, the refresh interval for the cell is changed to 4.5 days based on our HSPICE simulations where the default period was 64ms. The tool was used to evaluate the ac- cess time and cycle time of IGZO-DRAM used as L2/L3 last-level cache replacement. The retention period of the IGZO-DRAM is given by [14]:

퐶푑푟푎푚−푐푒푙푙 · ∆푉푐푒푙푙−푤표푟푠푡 푇푟푒푡푒푛푡푖표푛 = 퐼푤표푟푠푡−푙푒푎푘

푇푟푒푓푟푒푠ℎ = 0.9 · 푇푟푒푡푒푛푡푖표푛

Parameter Value 퐶푔푖푑푒푎푙(퐹/휇푚) 1.73E-12 퐶푓푟푖푛푔푒(퐹/휇푚) 1.00E-16 2 퐶푗푢푛푐(퐹/휇푚 ) 1.00E-11 2 퐶푗푢푛푐_푠푤(퐹/휇푚 ) 2.50E-12 퐿푝ℎ푦(휇푚) 0.01 푉 푑푑(푉 ) 0.8 푉푡ℎ(푉 ) 0.4 퐼표푛(퐴/휇푚) 1.77E-06 퐼표푓푓 (퐴/휇푚) 1.32E-19 2 퐶표푥(퐹/휇푚 ) 1.48E-14 푡표푥(휇푚) 0.002 퐶푑푟푎푚_푐푒푙푙(퐹 ) 10E-15 Table 4.1: IGZO-FET parameters for CACTI

Table 4.2 is the configuration of the memory structure used for the simulations. Since the area of the IGZO-DRAM cell is smaller than that of an SRAM cell, the memory cell area occupies a smaller footprint. Therefore, we can have same capacity IGZO-DRAM cache memory with smaller area or high capacity memory with same

21 area.

Parameter Value Memory size 2 MB Block size 64 bytes Associativity 8/0 Page size 8192 bytes I/O bus width 512 bytes Technology 32 nm Cache level L2 Cache model UCA

Table 4.2: Cache organization parameters

The data array and the tag array were modified to reflect the new technology. For comparison among the different memory types, the structure of the memory remains the same.

4.3 Results

We integrated the IGZO-DRAM technology into CACTI tool and ran simulation across all the traditional types of memories used as last level cache and compared those with our IGZO-DRAM in Table 4.3. From the simulated results we observe that the access/cycle time of the IGZO-DRAM is comparable to that of SRAM. And significant savings in leakage power than the SRAM.

Cache Parameters SRAM LP-DRAM COMM-DRAM IGZO-DRAM Access time(ns) 1.29 2.474 4.46 1.57 Cycle time(ns) 1.34 4.056 9.71 2.99 Total dynamic read en- 0.572 0.432 0.424 0.607 ergy per access(nJ) Total leakage power(mW) 675.40 70.075 26.36 37.20

Table 4.3: Timing and power for different memory types

Then we simulated the IGZO-DRAM as main memory and compared the read/write energy and refresh power with commodity DRAM.

22 Cache Parameters COMM-DRAM IGZO-DRAM Read energy(nJ) 0.588 0.288 Write energy(nJ) 0.585 0.269 Refresh power(mW) 0.031 0.01

Table 4.4: Timing and refresh power

From the cache structure study using the IGZO-DRAM, we see that the new technology is a viable candidate as a last-level cache memory. The IGZO-DRAM is 21% slower than that of SRAM but there are major savings in leakage power which is around 94%. And due the less area occupied by IGZO-DRAM compared to SRAM cache, we can have larger cache size with high packing density.

23 Chapter 5

Multi-core Full-system simulation

5.1 GEM5

Gem5[3] simulator is an open source tool for computer-system architecture simu- lation, encompassing system-level architecture as well as processor microarchitecture. It features most of the commercial Instruction Set Architectures and diverse CPU models. The gem5 simulator provides flexible, modular simulation system that is capable of evaluating a broad range of systems. This infrastructure provides flexibil- ity by offering a diverse set of CPU models, system execution modes, and memory system models. A full-system simulation provides detailed performance metrics of an entire sys- tem by executing both user-level and kernel-level instructions and models a complete system including the OS and devices. In full system mode, gem5 simulates all of the hardware from the CPU to the I/O devices. Fig. 5-1 shows a full-system struc- ture for Gem5 simulation. The full-system config file can be modified as peruser specifications.

5.2 Implementation

The gem5 simulator was used for full-system simulation based on ARM architec- ture running Android for mobile systems[9]. With the advent of modern mobile sys-

24 Figure 5-1: A system configuration with a two-level cache hierarchy[8] tems adopting multi-core processors with high performance, users suffer from limited battery life. High data movement between CPUs and memories contribute a lot to- wards power consumption in a mobile device. We used gem5 to simulate a full-system multi-core that operates Android with Linux kernel based on ARM architecture.

First a full-system is configured in a config file to replicate a mobile system[9] as shown in the Fig. 5-2. The Linux Kernel(3.14) was cross-compiled with the ARM Instruction Set Architecture to be used as the platform for running Android and then created an Android KitKat 4.4 image to be mounted on the full-system [7]. Then created a script to accommodate the system options as shown in the table 5.1 to be fed to the full-system config file along with the benchmark. The script runs the system in gem5 environment booting up Android in the Linux kernel and generates

25 the required output files. The output files consist of stats.txt and config.ini. The stats file shows the simulated time for the system, number of instructions committed by the CPU, instruction rate and many other timing related information. The config file contains a list of all the objects created from the full-system simulation alongwith the values.

Figure 5-2: Mobile full-system structure [9]

Next we ran a simulation with the traditional SRAM as L2 cache memory and generated the stats file for it. Then to implement the IGZO-DRAM, we updated the power models and latency in the cache config files and ran the simulation to generate the stats file with IGZO-DRAM as L2 cache memory.

5.3 Benchmark

There are several open-source benchmarks for processors and memories that gives the power and performance. For our design, the main objective was to study the power figures of the system and memory during the period of running benchmark

26 Platform Android Kitkat 4.4, Linux kernel 3.14 Core 1 GHz quad core CMP, ARM ISAs, out-of-order, tournament branch predictor, 4096 BTB entries, 64 reorder buffer, 32 fetch queue L1 I/D Cache 16kB, 4-way associativity, private, 64B cache line, LRU, 4 mshrs, 6 hit latency L2 Cache 2MB, 8-way associativity, shared, 64B cache line, random, 20 mshrs, 44 hit latency DRAM LPDDR3 800MHz, 2GB, Dual-channel, 32-bit bus width, 8 banks per rank, 4kB row buffer size, 1 rank per channel HPM 32nm technology

Table 5.1: Mobile full-system configuration [9]

and then the duration after the sleep signal. We used BBench which is an automated benchmark that tests a browser’s page rendering performance. It comprises a sequence of snap-shots of a varied selection of the most popular sites on the web. Because its goal is to test only rendering (as opposed to network) performance while minimizing run-to-run variance, BBench renders pages offline or from a local a mobile based full-system to run benchmark to give us the approximate performance of a real-world mobile device. We modified it as per our requirement. So the benchmark runs a large sleep interval to give us advantage of using the IGZO-DRAM over SRAM and then run the web-based codes.

5.4 McPAT

McPAT [6] (Multicore Power, Area, and Timing) is an integrated power, area, and timing modeling framework for multithreaded, multicore, and manycore architectures. It supports architects to use new metrics combining performance with both power and area which are useful to evaluate the cost of new architectural ideas. It models all three types of power dissipation which are dynamic, static and short-circuit, to give a complete view of the power envelope of multicore processors. McPAT uses technology projections from ITRS for power dissipation. The tool gives the power and area for all the components of a full-system. Fig. 5-3 shows the block diagram of the Mcpat

27 framework. It uses an XML-based interface with the performance simulator. The XML contains all the low-level configuration details and the high-level architectural parameters.

Figure 5-3: Block diagram of the McPAT framework [6].

5.5 Implementation

The stat.txt and config.ini files from the gem5 simulator, which contains time- based and system level information, are used as the input files for the McPAT frame- work. We used a parser [2] to convert the stats and config files to an XML file.To implement the IGZO-DRAM in the L2 cache memory, we modified the technology parameters file in the McPAT similar to what we did for CACTI. First we ran the simulation for the sleep-period generating output files for each of SRAM and IGZO-DRAM cache memory. And then for the active-period running the workloads.

5.6 Power model analysis

We used McPAT to analyze the power related to the L2 cache from the gem5 stats and config files. The tool breakdowns power consumption of all the components in the full-system.

28 Figure 5-4: Implementing to the McPAT framework.

Fig. 5-5 shows the power of the system components when the system is put to sleep. Here we see that during sleep mode, the IGZO-DRAM L2 cache memory has around 98% lower L2 leakage than the traditional SRAM cache memory.

Figure 5-5: Power components during sleep.

Fig. 5-6 shows the power of the system components when the system was running the workloads. Here we see that even during active mode, the IGZO-DRAM has around 81% lower leakage than the traditional SRAM. From the simulation results, it can be deduced that implementing IGZO-DRAM

29 Figure 5-6: Power components during active. as last level cache in a full-system mobile architecture, gave us around 90% lower overall L2 leakage power than SRAM memory.

30 Chapter 6

Conclusion

In this thesis, it can be seen that using CAAC-IGZO FET, which has extremely low off-state current leading to very low leakage, as the access transistor ina1T1C DRAM cell provides almost non-volatile operation of the IGZO-DRAM with the appropriate operating voltages and size of the capacitor. Due to its extremely low leakage, IGZO-DRAM is a viable candidate as a last level cache memory for battery operated devices where the power consumption is the primary metric of concern. From the circuit level simulations using HSPICE, it can be seen that the voltage range from -0.4V to 0.8V for the gate-under-drive of the IGZO-DRAM gives a re- tention period of 4.5 days which is much higher than the refresh interval 64ms of a traditional 1T1C DRAM. Furthermore, due to the low-leakage of the CAAC-IGZO FET, a smaller capacitance can be used for the storage element. From the cache level simulations using CACTI, it can be seen that IGZO-DRAM as last level cache is 21% slower than that of traditional SRAM cache memory. The low latency is due to the folded structure of the IGZO-DRAM which decreases the capacitance of the bitline, thereby decreasing the access time. In terms of power, IGZO-DRAM has around 94% lower leakage than SRAM. From the system level simulations using gem5 and McPAT, it can be seen that implementing IGZO-DRAM as L2 cache memory in a full-system, resulted in around 90% lower L2 leakage power. And overall positive power performance at the system level.

31 Chapter 7

Future Works

∙ We observed that there are substantial power savings after implementing IGZO- DRAM as last level cache. We did this work based on the device model pro- vided by SAMSUNG. The next step would be to fabricate the IGZO-DRAM as last level cache. Doing so will provide huge benefits in power consumption for the battery-operated consumer devices like smartphones, wearables, health- monitors,etc.

∙ The power savings due to the IGZO-DRAM as last level cache were obtained without implementing power-gating techniques. The next step would be to ap- ply power-gating methodologies at the system level which could further benefit the power performance.

32 Bibliography

[1] T. Atsumi, S. Nagatsuka, H. Inoue, T. Onuki, T. Saito, Y. Ieda, Y. Okazaki, A. Isobe, Y. Shionoiri, K. Kato, T. Okuda, J. Koyama, and S. Yamazaki. Dram using crystalline oxide semiconductor for access transistors and not requiring re- fresh for more than ten days. In 2012 4th IEEE International Memory Workshop, pages 1–4, May 2012.

[2] Ayymoose. gem5-mcpat-parser. https://github.com/Ayymoose/ gem5-mcpat-parser.

[3] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, August 2011.

[4] Spencer W. Ng Bruce Jacob and David T. Wang. Memory systems cache, dram, disk. 2009.

[5] H. Inoue, T. Matsuzaki, S. Nagatsuka, Y. Okazaki, T. Sasaki, K. Noda, D. Mat- subayashi, T. Ishizu, T. Onuki, A. Isobe, Y. Shionoiri, K. Kato, T. Okuda, J. Koyama, and S. Yamazaki. Nonvolatile memory with extremely low-leakage indium-gallium-zinc-oxide thin-film transistor. IEEE Journal of Solid-State Cir- cuits, 47(9):2258–2265, Sep. 2012.

[6] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 469–480, New York, NY, USA, 2009. ACM.

[7] Jason Lowe-Power. Building Android KitKat for gem5. http://www.gem5.org/ Android_KitKat.

[8] Jason Lowe-Power. Gem5 Tutorial documentation. http://learning.gem5. org/book/part1/cache_config.html#advanced-config-fig.

[9] Hyeonggyu Kim Minho Ju and S. Kim. Mofysim: A mobile full-system simula- tion framework for energy consumption and performance analysis. In 2016 IEEE

33 International Symposium on Performance Analysis of Systems and Software (IS- PASS), pages 245–254, April 2016.

[10] D. Ohgarane, M. Konishi, K. Dairiki, M. Oota, T. Hirohashi, M. Takahashi, M. Tsubuku, S. Yamazaki, Y. Kanzaki, H. Matsukizono, S. Kaneko, S. Mori, and T. Matsuo. Crystallography of in-ga-zn-o thin film having caac structure. In 2013 Twentieth International Workshop on Active-Matrix Flatpanel Displays and Devices (AM-FPD), pages 235–238, July 2013.

[11] T. Onuki, W. Uesugi, H. Tamura, A. Isobe, Y. Ando, S. Okamoto, K. Kato, T. R. Yew, , J. Y. Wu, , , J. Myers, K. Doppler, M. Fujita, and S. Yamazaki. Embedded memory and arm cortex-m0 core using 60-nm c-axis aligned crystalline indium- gallium-zinc oxide fet integrated with 65-nm si cmos. In 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), pages 1–2, June 2016.

[12] DDR2 SDRAM SPECIFICATION. Jedec standard. 2009.

[13] H. Tamura, K. Kato, T. Ishizu, W. Uesugi, A. Isobe, N. Tsutsui, Y. Suzuki, Y. Okazaki, Y. Maehashi, J. Koyama, Y. Yamamoto, S. Yamazaki, M. Fujita, J. Myers, and P. Korpinen. Embedded sram and cortex-m0 core using a 60-nm crystalline oxide semiconductor. IEEE Micro, 34(6):42–53, Nov 2014.

[14] S. J. E. Wilton and N. P. Jouppi. Cacti: an enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, 31(5):677–688, May 1996.

34