micromachines

Article Retention-Aware DRAM Auto-Refresh Scheme for Energy and Performance Efficiency

Wei-Kai Cheng * , Po-Yuan Shen and Xin-Lun Li Department of Information and Computer Engineering, Chung Yuan Christian University, Taoyuan 32023, Taiwan * Correspondence: [email protected]; Tel.: +886-3265-4714

 Received: 30 July 2019; Accepted: 6 September 2019; Published: 8 September 2019 

Abstract: Dynamic random access memory (DRAM) circuits require periodic refresh operations to prevent data loss. As DRAM density increases, DRAM refresh overhead is even worse due to the increase of the refresh cycle time. However, because of few the cells in memory that have lower retention time, DRAM has to raise the refresh frequency to keep the data integrity, and hence produce unnecessary refreshes for the other normal cells, which results in a large refresh energy and performance delay of memory access. In this paper, we propose an integration scheme for DRAM refresh based on the retention-aware auto-refresh (RAAR) method and 2x granularity auto-refresh simultaneously. We also explain the corresponding modification need on memory controllers to support the proposed integration refresh scheme. With the given profile of weak cells distribution in memory banks, our integration scheme can choose the most appropriate refresh technique in each refresh time. Experimental results on different refresh cycle times show that the retention-aware refresh scheme can properly improve the system performance and have a great reduction in refresh energy. Especially when the number of weak cells increased due to the thermal effect of 3D-stacked architecture, our methodology still keeps the same performance and energy efficiency.

Keywords: DRAM refresh; retention time; refresh interval; refresh cycle time; auto-refresh

1. Introduction Dynamic random access memory (DRAM) is widely used in electronic devices. DRAM has a high performance and low cost, and a DRAM cell is composed of an access and one leaky , each DRAM cell stores one-bit of data as an electrical charge in a capacitor. DRAM cells leak charge over time, causing stored data lost. To prevent this situation, DRAM cells require recharge periodically. These recharge operations are also known as DRAM refresh. There are two important parameters for DRAM refresh, refresh interval (tREFI) and refresh cycle time (tRFC). The sends a refresh command to DRAM devices every tREFI (7.8 us as usual), and the duration of a refresh command is referred to as tRFC. However, the value of tRFC worsens when the DRAM capacity increases. As described in Table1[ 1], we calculate refresh overhead by tRFC/tREFI. This value is 890 ns/7.8 us in a 32 Gb DRAM, it takes about 12.49% of refresh time to refresh all its cells. The larger the memory size, it will have more serious performance degradation and energy consumption. In the JEDEC standard [2], DRAM cells are refreshed every 64 ms at normal temperature (<85 ◦C) and 32 ms at high temperature (>85 ◦C). However, most of the DRAM cells have a longer retention time (tRET) than 64 ms. Research [3] indicated that in a 32 GB DDR3 DRAM device, only about 30 cells’ retention time is less than 128 ms, and only about 1000 cells require a refresh interval shorter than 256 ms. We define these low retention cells as weak cells, and a memory row that contains weak cells is defined as a weak row. It is obvious that a refresh of all rows every 64 ms is unnecessary since only a

Micromachines 2019, 10, 590; doi:10.3390/mi10090590 www.mdpi.com/journal/micromachines Micromachines 2019, 10, 590 2 of 19 few memory rows are weak rows. Because of the few low retention time cells, this fixed minimum refresh interval scheme leads to a large amount of unnecessary refresh power consumption and also degrades the service efficiency of memory requests [3].

Table 1. Refresh cycle time among different density.

Density tRFC 1 Gb 110 ns 2 Gb 160 ns 4 Gb 260 ns 8 Gb 350 ns 16 Gb 530 ns 32 Gb 890 ns

Because of the restriction of cell retention time, except the fixed refresh interval problem, another cause of a degrading memory access performance comes from the fact that DRAM refresh operation must have higher priority than memory request service. While in the refresh period, the memory request service of the rank being refreshed is forced to wait. In general, the memory controller issues an auto-refresh command at a fixed interval. It cannot ensure that the memory refresh and memory access request does not collide. Commodity DRAM device performs refresh operations at rank-level and leads to the idling of all banks in the rank during the refresh period. There has been some research on reducing the overhead of DRAM by retention-aware refresh or refresh scheduling [3–12]. Research [4] proposed a RAAR technique to reduce refresh energy and performance degradation, only part of refresh bundles need a refresh in each refresh interval, while other refresh bundles can skip at least 50% of refresh operations. Although this methodology could reduce a lot of redundant refreshes, because it chose a unique refresh scheme all over the RAAR period, it still wastes quite a number of refresh time on normal cells. Research [5] improved this phenomenon by integrating all the RAAR refresh schemes and chose the optimal refresh scheme for different refresh bundles to minimize refresh time and refresh energy. Research [6] integrated the RAAR scheme with the bank reordering technique to gather up weak rows in adjacent banks, and hence less refresh bundles need refresh in each refresh interval. As observed in research [7], a refreshing bundle that contains weak rows still contain lots of normal cells even if we choose the optimal refresh scheme for it. Therefore, research [8] proposed an integration scheme to integrate RAAR and built-in self-repair (BISR) technique, selected weak rows are repaired with the BISR technique to minimize refresh overhead of refresh bundles. Other related works to improve refresh efficiency are introduced below. Research [3] grouped DRAM rows into retention time bins and applied a different refresh rate to each bin. Research [7] presented a DRAM refresh technique that simultaneously leveraged bank-level and subarray-level concurrency to reduce the overhead of distributed refresh operations in the . A bundle of DRAM rows in a refresh operation was composed of two subgroups, both subgroups of DRAM rows were refreshed concurrently during a refresh command to reduce refresh cycle time. Research [9] proposed to issue per-bank refreshes to idle banks in an out-of-order manner instead of round-robin order. Research [10] applied a refresh interval longer than the conventional refresh interval (64 ms), and memory controller issued read operations to the weak rows every required refresh interval in order to retain the data in weak rows. Research [11] proposed an extension to the existing DRAM control register access protocol to include the internal refresh counter and introduced a new dummy refresh command hat skipped refresh operations and simply incremented the internal counter. Research [12] proposed a set of techniques to coordinate the scheduling of low power mode transitions and refresh commands such that most of the required refreshes were scheduled in the low power self-refresh mode. Research [13] mitigated refresh overhead by a caching scheme based on the fact that ranks in the same channel were refreshed in a staggered manner. Micromachines 2019, 10, x 3 of 20 transitions and refresh commands such that most of the required refreshes were scheduled in the low power self-refresh mode. Research [13] mitigated refresh overhead by a caching scheme based on the Micromachines 2019, 10, 590 3 of 19 fact that ranks in the same channel were refreshed in a staggered manner. In summary of the related works that addressed the RAAR technique, although research [5] had proposedIn summary an integration of the relatedrefresh worksscheme, that it did addressed not integrate the RAAR the 2x technique, granularity although auto-refresh research and [ 5hence] had proposedthe performance an integration improvement refresh was scheme, less itthan did 10%. not integrate Research the [6] 2x and granularity research auto-refresh[8] also reveal and the hence fact thethat performance although they improvement integrated wasRAAR less scheme than 10%. with Research bank ordering [6] and researchand BISR [8 technique,] also reveal respectively, the fact that althoughthere was theystill very integrated little performance RAAR scheme improvement with bank orderingachieved. and All BISRthese technique,were because respectively, that weak therecells wasonly stilloccupy very a little low performancepercentage of improvement all memory cells, achieved. if the All2x thesegranularity were becausewas not that integrated weak cells into only the occupyrefresh ascheme, low percentage all other of optimization all memory cells,techniques if the 2x can granularity only obtain was notlimited integrated reduction into on the refresh scheme,overhead. all For other other optimization related RAAR techniques related cantechniques only obtain [3,9–11], limited they reduction also did onnot refresh take into overhead. account Forthe other2x granularity related RAAR issue related in their techniques works. Based [3,9–11 on], theythese also observations, did not take in into this account paper, the we 2x propose granularity an issueintegration in their scheme works. for Based DRAM on refresh these observations,based on the retention-aware in this paper, we auto-refresh propose an (RAAR) integration method scheme and for2x granularity DRAM refresh auto-refresh based on simultaneously, the retention-aware and auto-refreshthe corresponding (RAAR) memory method controller and 2x granularity needs to auto-refreshsupport the simultaneously,proposed integration and the refresh corresponding scheme. With memory a given controller profile needs of weak to support cells distribution the proposed in integrationmemory banks, refresh our scheme. integration With scheme a given can profile choose of the weak most cells appropri distributionate refresh in memory technique banks, in each our integrationrefresh time. scheme can choose the most appropriate refresh technique in each refresh time. The rest of thisthis paperpaper isis organizedorganized asas follows.follows. Section 22 showsshows thethe backgroundbackground ofof DRAMDRAM organization, refresh interval, refresh cycle time,time, andand refreshrefresh command.command. Section 33 illustratesillustrates thethe motivation of our integration refresh scheme. In Section Section 44,, we propose the 2x granularity auto-refresh, integration refresh scheme, and correspondingcorresponding memory controller. Section 55 showsshows thethe evaluationevaluation results ofof ourour methodology methodology on on di differentfferent weak weak row ro percentagew percentage and and refresh refresh cycle cycle time. time. Finally, Finally, we draw we concludingdraw concluding remarks remarks in Section in Section6. 6.

2. Background The modern DRAM system is a hierarchical organizationorganization [9]. [9]. The hierarchies from high to low levels are rank and bank as shown in Figure 1 1.. EachEach channelchannel connectsconnects oneone oror moremore ranksranks toto aa memorymemory controller. Furthermore,Furthermore, banksbanks are are an an array array structure structure containing containing rows rows and and columns, columns, in whichin which each each cell storescell stores one one bit of bit data. of data. However, However, cells cells leak leak charge charge over over time time which which causes causes data data loss. loss.

Rank 1 Channel Rank 0 Memory Controller Bank Bank

Figure 1. DynamicDynamic random access memory (D (DRAM)RAM) hierarchical organization.

Multiple ranks sharing one channel is common in aa modernmodern memorymemory system.system. In a multi-rank system, allall ranks’ranks’ refreshrefresh operations operations are are staggered, staggered, there there is nois no overlap overlap among among each each rank’s rank’s refresh. refresh. The advantageThe advantage of a multi-rankof a multi-rank memory memory system system is that is while that onewhile rank one is rank refreshing, is refreshing, the other the ranks other can ranks still Micromachinesservecan still memory serve 2019 memory requests., 10, x requests. Figure2 showsFigure a2 staggeredshows a staggered refresh in refresh a four-rank in a four-rank memory system.memory system.4 of 20

tREFI

Rank0

Rank1

Rank2

Rank3

Memory Refresh access Figure 2. StaggeredStaggered auto-refresh in a four-rank system.

There are three major parts that contribute to the power consumption of DRAM, background power, active power, and refresh power. Background power comes from peripheral circuitry energy consumption. Low power mode disables some peripheral circuitry to reduce background power consumption. Active power is only consumed while servicing a memory request. To service a memory request, the target row must be first activated and precharged. Since a DRAM cell is composed of an access transistor and a leaky capacitor, the DRAM is required to be refreshed periodically and thereby dissipating its refresh power. According to our evaluation, refresh power accounts up to 25%–27% of the total energy consumption in a 32 GB DRAM device, and this percentage will even increase as the DRAM size increased in the future [3]. Therefore, refresh power reduction becomes a crucial issue in the future. Refreshing all rows at the same time leads to long refresh latency. Therefore, to avoid long refresh latency, the total rows in a bank are divided into 8K groups [13]. It is evenly distributing the refresh latency among runtime. That is to say, 8K auto-refresh commands are issued to refresh a subset of rows in 64 ms interval. As the example shown in Figure 3, 8K auto-refresh commands are issued every 64 ms/8K = 7.8 us.

tREFT = 64ms

REF0 REF1 REF2 REF8191

tREFI = 7.8us tREFI = 7.8us

Figure 3. All rows are refreshed by 8K refresh commands.

Commodity DRAM device supports all-bank auto-refresh. All banks are unavailable for tRFC cycle times while all-bank auto-refresh commands are issued. In contrast, per-bank auto-refresh [14] is also a supported refresh technique in a standard DRAM device. Per-bank refresh divides one all- bank refresh into eight per-bank refreshes. Per-bank refreshes only refresh one bank at a time. In other words, only one bank is idle for refresh and the other banks can still serve memory requests. The advantage of per-bank refresh is that it is more efficient for one bank refresh than all-bank refresh due to shorter tRFC. Figure 4 illustrates the comparisons of all-bank refresh and per-bank refresh, tRFCpb is about 2.3x shorter than tRFCab.

Micromachines 2019, 10, x 4 of 20

tREFI

Rank0

Rank1

Rank2

Rank3 Micromachines 2019, 10, 590 4 of 19 Memory Refresh access There are three major partsFigure that 2. Staggered contribute auto-refresh to the in power a four-rank consumption system. of DRAM, background power, active power, and refresh power. Background power comes from peripheral circuitry energy consumption.There Loware three power major mode parts disables that contribute some to peripheral the power circuitryconsumption to reduceof DRAM, background background power consumption.power, active Active power, power and is refresh only consumedpower. Backgrou whilend servicing power comes a memory from peripheral request. Tocircuitry service energy a memory consumption. Low power mode disables some peripheral circuitry to reduce background power request, the target row must be first activated and precharged. Since a DRAM cell is composed of an consumption. Active power is only consumed while servicing a memory request. To service a accessmemory transistor request, and a the leaky target capacitor, row must the be DRAM first activated is required and to precharged. be refreshed Since periodically a DRAM andcell therebyis dissipatingcomposed its refresh of an access power. transistor According and toa leaky our evaluation, capacitor, the refresh DRAM power is required accounts to be up refreshed to 25–27% of the totalperiodically energy consumption and thereby dissipating in a 32 GB its DRAM refresh device,power. According and this percentageto our evaluation, will even refresh increase power as the DRAMaccounts size increased up to 25%–27% in the future of the [ 3total]. Therefore, energy consumption refresh power in a reduction 32 GB DRAM becomes device, a crucial and this issue in the future.percentage will even increase as the DRAM size increased in the future [3]. Therefore, refresh power Refreshingreduction becomes all rows a atcrucial the same issue timein the leads future. to long refresh latency. Therefore, to avoid long refresh Refreshing all rows at the same time leads to long refresh latency. Therefore, to avoid long latency, the total rows in a bank are divided into 8K groups [13]. It is evenly distributing the refresh refresh latency, the total rows in a bank are divided into 8K groups [13]. It is evenly distributing the latencyrefresh among latency runtime. among That runtime. is to That say, 8Kis to auto-refresh say, 8K auto-refresh commands commands are issued are issued to refresh to refresh a subset a of rows insubset 64 ms of interval.rows in 64 As ms the interval. example As the shown exampl in Figuree shown3, in 8K Figure auto-refresh 3, 8K auto-refresh commands commands are issued are every 64 ms/issued8K = 7.8every us. 64 ms/8K = 7.8 us.

tREFT = 64ms

REF0 REF1 REF2 REF8191

tREFI = 7.8us tREFI = 7.8us

FigureFigure 3. All3. All rows rows are are refreshed refreshed by by 8K 8K refresh refresh commands. commands.

CommodityCommodity DRAM DRAM device device supports supports all-bank all-bank auto-refresh. All All banks banks are areunavailable unavailable for tRFC for tRFC cycle timescycle times while while all-bank all-bank auto-refresh auto-refresh commands commands are are issued. issued. In contrast, contrast, per-bank per-bank auto-refresh auto-refresh [14] [14] is also ais supported also a supported refresh refresh technique technique in astandard in a standa DRAMrd DRAM device. device. Per-bank Per-bank refresh refresh dividesdivides one one all- all-bank bank refresh into eight per-bank refreshes. Per-bank refreshes only refresh one bank at a time. In refresh into eight per-bank refreshes. Per-bank refreshes only refresh one bank at a time. In other other words, only one bank is idle for refresh and the other banks can still serve memory requests. words,The only advantage one bank of per-bank is idle forrefresh refresh is that and it is themore other efficient banks for one can bank still refresh serve than memory all-bank requests. refresh The advantagedue to of shorter per-bank tRFC. refresh Figure is 4 thatillustrates it is more the comparisons efficient for of one all-bank bank refresh and than per-bank all-bank refresh, refresh due to shortertRFCpb tRFC. is about Figure 2.3x4 illustratesshorter than the tRFCab. comparisons of all-bank refresh and per-bank refresh, tRFCpb is aboutMicromachines 2.3x shorter 2019, 10 than, x tRFCab. 5 of 20

tRFCab Bank0 REF Bank1 REF

Bank7 REF (a) an all-bank refresh command is issued

tRFCpb

Bank0 REF Bank1 REF

Bank7 REF (b) 8 per-bank refresh commands are issued FigureFigure 4. Comparisons 4. Comparisons of of anan all-bankall-bank re refreshfresh and and per-bank per-bank refresh. refresh.

As describedAs described previously, previously, the the total total number number ofof rowsrows in in a a bank bank is isdivided divided into into 8K groups 8K groups for refresh for refresh operations.operations. In addition In addition to dividingto dividing rows rows into into 8K8K groups, JEDEC JEDEC DDR4 DDR4 standard standard [15] [also15] support also support 2x 2x granularity auto-refresh. The 2x granularity auto-refresh scheme divides the total number of rows in granularity auto-refresh. The 2x granularity auto-refresh scheme divides the total number of rows a bank into 16K groups. Therefore, only half of the rows are refreshed per auto-refresh command. in a bank into 16K groups. Therefore, only half of the rows are refreshed per auto-refresh command. Figure 5a shows a bank composed of 32K rows. The 32K rows are divided into 8K groups by default. FigureTherefore,5a shows four a bank rows composed are refreshed of in 32K a bank rows. per The auto-refresh 32K rows command. are divided Figure into 5b shows 8K groups an example by default. of 2x granularity auto-refresh. It divides the total number of rows in a bank into 16K groups and only two rows are refreshed per refresh operation.

Row 32767 REF16383 Row 32767 REF8191 #4rows #2rows

Row 3 REF1 Row 3 #2rows REF0 Row 2 Row 2 #4rows Row 1 REF0 Row 1 Row 0 #2rows Row 0 Bank Bank (a) 1x auto refresh (b) 2x auto refresh

Figure 5. Illustrations of (a) 1x and (b) 2x auto-refresh.

3. Motivation and Example In a DRAM refresh, a refresh command includes 8192 refresh times in 64 ms of a refresh interval. Although RAAR reduces a large number of unnecessary refreshes, some banks that do not contain weak rows still have to be refreshed because of all-bank refresh mode. Also, although 2x granularity reduces refresh interval to one half of that in 1x granularity, its refresh cycle time exceeds one half of that in 1x granularity. For example of the 8 Gb DRAM, for 1x granularity its refresh interval is 7.8 us and refresh cycle time is 350 ns, and for 2x granularity its refresh interval is 3.9us and refresh cycle time is 260 ns. Two 2x granularity refresh requires 520 ns while one 1x granularity refresh only require 350 ns. Therefore, a retention-aware refresh scheme to corporate with 2x granularity refresh is necessary. For the RAAR technique proposed by research [4], only refresh bundles that contain weak rows need a refresh in each refresh interval. However, weak rows distribution between refresh bundles may differ dramatically. If the same refresh scheme is used all over the RAAR period, there still will be quite number of refresh time wasted due to extra normal cells refreshing or bad weak rows refreshing sequentially. For the example illustrated in Figure 6, we executed refresh commands with the same refresh scheme all over the refresh interval (refresh 8192 times in 64 ms). Figure 6a shows the results when

Micromachines 2019, 10, x 5 of 20

tRFCab Bank0 REF Bank1 REF

Bank7 REF (a) an all-bank refresh command is issued

tRFCpb

Bank0 REF Bank1 REF

Bank7 REF (b) 8 per-bank refresh commands are issued Figure 4. Comparisons of an all-bank refresh and per-bank refresh.

As described previously, the total number of rows in a bank is divided into 8K groups for refresh operations. In addition to dividing rows into 8K groups, JEDEC DDR4 standard [15] also support 2x granularityMicromachines 2019auto-refresh., 10, 590 The 2x granularity auto-refresh scheme divides the total number of rows5 of in 19 a bank into 16K groups. Therefore, only half of the rows are refreshed per auto-refresh command. Figure 5a shows a bank composed of 32K rows. The 32K rows are divided into 8K groups by default. Therefore, four rows are refreshed in a bank per auto-refresh command. Figure 55bb showsshows anan exampleexample of 2x granularity auto-refresh. It divides the total number number of rows in a bank bank into into 16K 16K groups groups and and only only two rows are refreshed per refresh operation.operation.

Row 32767 REF16383 Row 32767 REF8191 #4rows #2rows

Row 3 REF1 Row 3 #2rows REF0 Row 2 Row 2 #4rows Row 1 REF0 Row 1 Row 0 #2rows Row 0 Bank Bank (a) 1x auto refresh (b) 2x auto refresh

Figure 5. IllustrationsIllustrations of of ( (aa)) 1x 1x and and ( (bb)) 2x 2x auto-refresh. auto-refresh.

3. Motivation Motivation and and Example Example In a a DRAM DRAM refresh, refresh, a a refresh refresh command command includes includes 81 819292 refresh refresh times times in in 64 64 ms ms of of a a refresh refresh interval. interval. Although RAAR reduces a large number of unnecessa unnecessaryry refreshes, some banks that do not contain weak rows still have to be refreshed because of all-bank refresh mode. Also, although 2x granularity reduces refresh interval to one half of that that in in 1x 1x gr granularity,anularity, its refresh cycle time exceeds one half of that in 1x granularity. For example of the 8 Gb DRAM, for for 1x 1x granularity granularity its its refresh refresh interval is 7.8 us and refresh cycle time is 350 ns, and for 2x granularity its refresh interval is 3.9us and refresh cycle time and refresh cycle time is 350 ns, and for 2x granularity its refresh interval is 3.9us and refresh cycle timeis 260 is ns. 260 Two ns. 2x Two granularity 2x granularity refresh requiresrefresh requires 520 ns while 520 onens while 1x granularity one 1x granularity refresh only refresh require only 350 requirens. Therefore, 350 ns. a Therefore, retention-aware a retention-aware refresh scheme refresh to corporate scheme to with corporate 2x granularity with 2x refreshgranularity is necessary. refresh is necessary.For the RAAR technique proposed by research [4], only refresh bundles that contain weak rows needFor a refresh the RAAR in each technique refresh proposed interval. by However, research weak [4], only rows refresh distribution bundles between that contain refresh weak bundles rows needmay dia ffrefresher dramatically. in each refresh If the interval. same refresh However, scheme weak is rows used distribution all over the RAARbetween period, refresh there bundles still maywill bediffer quite dramatically. number of refreshIf the same time refresh wasted scheme due to is extra used normal all over cells the refreshingRAAR period, or bad there weak still rows will berefreshing quite number sequentially. of refresh time wasted due to extra normal cells refreshing or bad weak rows refreshingFor the sequentially. example illustrated in Figure6, we executed refresh commands with the same refresh schemeFor allthe over example the refresh illustrated interval in Figure (refresh 6, 8192we ex timesecuted in 64refresh ms). commands Figure6a shows with thethe resultssame refresh when schemechoosing all the over all-bank the refresh refresh interval scheme, (refresh we see 8192 that itti performsmes in 64 badly ms). Figure in the first6a shows refresh the because results of when three unnecessary banks to be refreshed simultaneously. However, this scheme performs well for the second refresh since all banks contain weak rows. On the other hand, when we choose the per-bank refresh scheme as shown in Figure6b, although it is appropriate for the first refresh, it needs to refresh all the banks sequentially and causes a lot of waste of refresh time and refresh energy when it starts on the second refresh. In research [5], in addition to all-bank refresh and per-bank refresh, two new fine-granularity refresh schemes, namely half-bank refresh, and quarter-bank refresh were proposed to further reduce unnecessary refresh. Although this integration refresh scheme combined both the advantage of all-bank refresh and per-bank refresh, provided fine-granularity refresh modes selection to resolve the problem of too much unnecessary refresh in all-bank refresh and high refresh overhead in per-bank refresh, it did not integrate the 2x granularity auto-refresh and hence the performance improvement was less than 10%. As described at the beginning of this section, replacing the 1x granularity auto-refresh with 2x granularity auto-refresh directly without co-operating with the retention-aware refresh scheme is useless. In this paper, we propose a methodology to integrate 2x granularity auto-refresh and the retention-aware refresh scheme which can have significant performance improvement as shown in experimental results. Micromachines 2019, 10, x 6 of 20 choosing the all-bank refresh scheme, we see that it performs badly in the first refresh because of three unnecessary banks to be refreshed simultaneously. However, this scheme performs well for the second refresh since all banks contain weak rows. On the other hand, when we choose the per-bank refresh scheme as shown in Figure 6b, although it is appropriate for the first refresh, it needs to refresh allMicromachines the banks2019 sequentially, 10, 590 and causes a lot of waste of refresh time and refresh energy when it starts6 of 19 on the second refresh.

(a)

(b)

FigureFigure 6. 6. (a(a) )All-bank All-bank refresh refresh to to refresh refresh weak weak row; row; ( (bb)) per-bank per-bank refresh refresh to to refresh refresh weak row.

4. MethodologyIn research [5], in addition to all-bank refresh and per-bank refresh, two new fine-granularity refreshIn schemes, this section, namely we describe half-bank our refresh, approach and for quar retention-awareter-bank refresh refresh were optimization, proposed to further including reduce fine unnecessarygranularity RAAR, refresh. integration Although refreshthis integration scheme, andrefresh the scheme corresponding combined memory both the controller advantage to support of all- bankthe proposed refresh and integration per-bank refresh refresh, scheme. provided fine-granularity refresh modes selection to resolve the problem of too much unnecessary refresh in all-bank refresh and high refresh overhead in per-bank refresh,4.1. 2x Granularity it did not integrate Retention-Aware the 2x granularity Auto-Refresh auto-refresh (RAAR) and hence the performance improvement was lessSince than there 10%. are As large described amounts at the of rows beginning that are of unnecessarythis section, replacing to be refreshed the 1x ingranularity every minimal auto- refreshretention with time 2x refresh granularity interval auto-refresh (tREFT) (64 ms directly as usual), without RAAR co-operating profiles the DRAMwith the rows’ retention-aware retention time refreshand tags scheme the weak is useless. rows, non-weak In this paper, rows we were propose not refreshed a methodology every tREFI. to integrate The key 2x idea granularity of RAAR auto- is to refreshonly refresh and weakthe rowsretention-aware in a high frequency refresh and sche skipme refreshing which non-weakcan have rows.significant Therefore, performance refreshing improvementfewer rows per as auto-refreshshown in experimental command results. is better for RAAR. RAAR adopts the default auto-refresh and 2x granularity auto-refresh to reduce unnecessary rows being refreshed. Furthermore, to skip 4. Methodology refresh non-weak rows, we accessed and increased the value of the internal refresh counter without refreshingIn this rows. section, we describe our approach for retention-aware refresh optimization, including fine granularityFigure7 illustrates RAAR, how integration RAAR works. refresh There scheme, are sixteen and the rows corresponding in a bank and memory only four controller of them are to supportweak rows the (row0,proposed 10, integration 13, 14). We refresh assumed scheme. a 1x auto-refresh command that refreshes four rows. At first refresh (1), it refreshes row0–row3. However, only row0 is a weak row. Therefore, we use two 2x 4.1.auto-refresh 2x Granularity commands Retention-Aware to refresh theseAuto-Refresh four rows. (RAAR) Actually, only the first 2x auto-refresh command refreshSince row0 there and are row1 large because amounts of weak of rows row0. that The are second unnecessary 2x auto-refresh to be refr commandeshed in only every modifies minimal the retentionrefresh counter time refresh from two interval to three (tREFT) and does (64 notms incuras usual), any refreshRAAR overhead.profiles the As DRAM per the rows’ second retention refresh time(2) shown and tags in Figure the weak7, there rows, is nonon-weak weak row rows in row4–row7. were not refreshed We only every increased tREFI. the The refresh key idea counter of RAAR by 1x auto-refresh without any refresh overhead. The condition of the third refresh (3) was similar to the first refresh (1), only one weak row exits. Two 2x auto-refresh commands were issued but actually only refresh row10 and row11. In the fourth refresh (4), both the first half and second half contain a weak row. So, we refreshed these four rows as normal by one 1x auto-refresh. In this example, RAAR only refreshed eight rows rather than refreshing all sixteen rows. Micromachines 2019, 10, x 7 of 20

is to only refresh weak rows in a high frequency and skip refreshing non-weak rows. Therefore, refreshing fewer rows per auto-refresh command is better for RAAR. RAAR adopts the default auto- refresh and 2x granularity auto-refresh to reduce unnecessary rows being refreshed. Furthermore, to skip refresh non-weak rows, we accessed and increased the value of the internal refresh counter without refreshing rows. Figure 7 illustrates how RAAR works. There are sixteen rows in a bank and only four of them are weak rows (row0, 10, 13, 14). We assumed a 1x auto-refresh command that refreshes four rows. At first refresh (1), it refreshes row0–row3. However, only row0 is a weak row. Therefore, we use two 2x auto-refresh commands to refresh these four rows. Actually, only the first 2x auto-refresh command refresh row0 and row1 because of weak row0. The second 2x auto-refresh command only modifies the refresh counter from two to three and does not incur any refresh overhead. As per the second refresh (2) shown in Figure 7, there is no weak row in row4–row7. We only increased the refresh counter by 1x auto-refresh without any refresh overhead. The condition of the third refresh (3) was similar to the first refresh (1), only one weak row exits. Two 2x auto-refresh commands were issued but actually only refresh row10 and row11. In the fourth refresh (4), both the first half and Micromachinessecond2019 half, 10contain, 590 a weak row. So, we refreshed these four rows as normal by one 1x auto-refresh. 7 of 19 In this example, RAAR only refreshed eight rows rather than refreshing all sixteen rows.

FigureFigure 7. 7.Retention-aware Retention-aware auto-refresh. auto-refresh.

4.2. Integration4.2. Integration Refresh Refresh Scheme Scheme FigureFigure8 shows 8 shows the programthe program flow flow of of our our integration integration refresh scheme. scheme. After After the thetesting testing and and measurement,measurement, we firstlywe firstly produce produce the the profile profile ofof retentionretention time time under under different different thermal thermal environment environment and store the information of weak rows into a weak row buffer. A 64 ms refresh counter was used to and store the information of weak rows into a weak row buffer. A 64 ms refresh counter was used to issue the refresh command for weak rows. For each 64 ms interval, we selected target row sets by a issue thespecified refresh refresh command mode. When for weak the target rows. refresh For each row sets 64 ms contained interval, weak we rows, selected we then target decided row the sets by a specifiedoptimal refresh refresh mode. mode When and refreshed the target the refresh row sets. row Otherwise, sets contained the memory weak controller rows, we would then provide decided the optimalservice refresh for memory mode and requests refreshed as normal. therow sets. Otherwise, the memory controller would provide service for memory requests as normal. Micromachines 2019, 10, x 8 of 20

FigureFigure 8. 8.Proposed Proposed program program flow. flow.

The decision of refresh mode for a row set is based on its weak rows distribution, the respective tRFC of each refresh scheme, and the 2x granularity refresh method in JEDEC DDR4 standard [15]. Table 2 shows the tRFC ratio of different refresh modes used in our integration refresh scheme. Because per-bank refresh is already a fine granularity refresh scheme, 2x granularity auto-refresh is not applied to it. We calculated the refresh overhead of each refresh scheme for the target row set and chose the most appropriate refresh mode. Figure 9 shows an example of integration refresh scheme to mitigate an unnecessary refresh. In a refresh of the first-row set, we chose one half-bank refresh, which was better than a two quarter-bank refresh or three per-bank refresh. In the refresh of the second-row set, one per-bank refresh was the best one. In refresh of the third-row set, one quarter- bank refresh was better than two per-bank refresh. And in refresh of the fourth-row set, one all-bank refresh was the best, which was better than two half-bank refresh, four quarter-bank refresh, or five per-bank refresh.

Table 2. Refresh cycle time (tRFC) parameters definition and 2x granularity refresh check.

Refresh Mode tRFC 2x Granularity All-bank refresh K Y Half-bank refresh K/1.32 Y Quarter-bank refresh K/1.74 Y Per-bank refresh K/2.3 N

Micromachines 2019, 10, 590 8 of 19

The decision of refresh mode for a row set is based on its weak rows distribution, the respective tRFC of each refresh scheme, and the 2x granularity refresh method in JEDEC DDR4 standard [15]. Table2 shows the tRFC ratio of di fferent refresh modes used in our integration refresh scheme. Because per-bank refresh is already a fine granularity refresh scheme, 2x granularity auto-refresh is not applied to it. We calculated the refresh overhead of each refresh scheme for the target row set and chose the most appropriate refresh mode. Figure9 shows an example of integration refresh scheme to mitigate an unnecessary refresh. In a refresh of the first-row set, we chose one half-bank refresh, which was better than a two quarter-bank refresh or three per-bank refresh. In the refresh of the second-row set, one per-bank refresh was the best one. In refresh of the third-row set, one quarter-bank refresh was better than two per-bank refresh. And in refresh of the fourth-row set, one all-bank refresh was the best, which was better than two half-bank refresh, four quarter-bank refresh, or five per-bank refresh.

Table 2. Refresh cycle time (tRFC) parameters definition and 2x granularity refresh check.

Refresh Mode tRFC 2x Granularity All-bank refresh K Y Half-bank refresh K/1.32 Y Quarter-bank refresh K/1.74 Y Per-bank refresh K/2.3 N Micromachines 2019, 10, x 9 of 20

Figure 9. IntegrationIntegration refresh refresh scheme scheme in each refresh. 4.3. Memory Controller 4.3. Memory Controller To accomplish the integration refresh scheme, a little modification on the memory controller is To accomplish the integration refresh scheme, a little modification on the memory controller is required, which is also the piece of the overhead of our proposal. Figure 10 shows the modification of required, which is also the piece of the overhead of our proposal. Figure 10 shows the modification the memory controller to fit our integration refresh scheme. In the original memory controller, there is of the memory controller to fit our integration refresh scheme. In the original memory controller, a refresh countdown register to issue a refresh command according to the refresh period. When the there is a refresh countdown register to issue a refresh command according to the refresh period. register counts down to zero, the memory controller issues a refresh command. However, the memory When the register counts down to zero, the memory controller issues a refresh command. However, controller cannot specify the rows to be refreshed, which is determined by the row counter inside the the memory controller cannot specify the rows to be refreshed, which is determined by the row DRAM. Therefore, we added two extra registers for the need of our integration refresh scheme—a counter inside the DRAM. Therefore, we added two extra registers for the need of our integration refresh address register to monitor the refresh rows and a weak row register to record weak rows. Since refresh scheme—a refresh address register to monitor the refresh rows and a weak row register to the profile of weak cells distribution in memory banks are obtained after the testing and measurement record weak rows. Since the profile of weak cells distribution in memory banks are obtained after the analysis of the memory wafer, we do not need to detect weak rows on the fly, and only address the testing and measurement analysis of the memory wafer, we do not need to detect weak rows on the weak rows that are recorded in the weak row register, no bitmap for all rows nor a counter for each fly, and only address the weak rows that are recorded in the weak row register, no bitmap for all row is necessary. In addition, since the profiles are produced after the wafer measurement, we either rows nor a counter for each row is necessary. In addition, since the profiles are produced after the do not need to change refresh scheme on the fly, as refresh commands for different refresh bundles are wafer measurement, we either do not need to change refresh scheme on the fly, as refresh commands prepared based on our integration refresh scheme in advance. for different refresh bundles are prepared based on our integration refresh scheme in advance.

Memory controller

Refresh Countdown Issue refresh when RC is 0 DRAM Register

Refresh Weak Address Row Register Register

Figure 10. Modified memory controller.

Figure 11 shows the organization of a weak row register. For the example of MT41K1G8 [1], the memory architecture has two channels and two ranks, each rank has eight banks, and each bank has 64K rows. When the memory architecture is changed, bit organization of weak row register is also

Micromachines 2019, 10, x 9 of 20

Figure 9. Integration refresh scheme in each refresh.

4.3. Memory Controller To accomplish the integration refresh scheme, a little modification on the memory controller is required, which is also the piece of the overhead of our proposal. Figure 10 shows the modification of the memory controller to fit our integration refresh scheme. In the original memory controller, there is a refresh countdown register to issue a refresh command according to the refresh period. When the register counts down to zero, the memory controller issues a refresh command. However, the memory controller cannot specify the rows to be refreshed, which is determined by the row counter inside the DRAM. Therefore, we added two extra registers for the need of our integration refresh scheme—a refresh address register to monitor the refresh rows and a weak row register to record weak rows. Since the profile of weak cells distribution in memory banks are obtained after the testing and measurement analysis of the memory wafer, we do not need to detect weak rows on the fly, and only address the weak rows that are recorded in the weak row register, no bitmap for all rows nor a counter for each row is necessary. In addition, since the profiles are produced after the Micromachineswafer measurement,2019, 10, 590 we either do not need to change refresh scheme on the fly, as refresh commands 9 of 19 for different refresh bundles are prepared based on our integration refresh scheme in advance.

Memory controller

Refresh Countdown Issue refresh when RC is 0 DRAM Register

Refresh Weak Address Row Register Register

FigureFigure 10. 10.Modified Modified memory memory controller. controller.

FigureFigure 11 shows 11 shows the the organization organization of a a weak weak row row register. register. For the For example the example of MT41K1G8 of MT41K1G8 [1], the [1], the memorymemory architecture architecture has has two two channels channels and and two two ranks, ranks, each eachrank has rank eight has ba eightnks, and banks, each andbank each has bank 64K rows. When the memory architecture is changed, bit organization of weak row register is also has 64K rows. When the memory architecture is changed, bit organization of weak row register is Micromachines 2019, 10, x 10 of 20 also modified accordingly. Basically, the weak row percentage is less than 0.5% in normal, and even in a worsemodified environment accordingly. this Basically, percentage the weak is lessrow thanpercen 1%tage for is almostless than all 0.5% operating in normal, conditions and even in [3 a]. This meansworse that environment less than 13440 this percentage bit (640 register is less than entries) 1% for space almost is all required operating for conditions 32 Gb DRAM[3]. This means in standard conditions,that less and than 67200 13440 bitbit (640 (3200 register register entries) entries) space space is required is required for 32 Gb for DRAM the extreme in standard case conditions, of a 5% weak row percentageand 67200 whichbit (3200 has register probability entries) near space zero. is re Inquired addition, for the since extreme the weak case rowof a profile5% weak is producedrow percentage which has probability near zero. In addition, since the weak row profile is produced after after wafer measurement and we do not need to change it on the fly, we can use Mask ROM or Nor wafer measurement and we do not need to change it on the fly, we can use Mask ROM or Nor Flash Flash instead of of logic logic registers registers to to record record these these data, data, which which greatly greatly reduce reduce the overhead. the overhead. Therefore, Therefore, the the overheadoverhead is about is about 1 KB 1 ofKB Maskof Mask ROM ROM or or Nor Nor FlashFlash for most most operating operating conditions, conditions, and andless than less 8 than KB 8 KB of Maskof Mask ROM ROM or Nor or Nor Flash Flash in thein the extreme extreme case. case.

channel rank bank row 1bit 1bit 3bit 16bit 0 0 010 0011 1111 1111 0101

0 0 100 0011 0000 1111 0101

1 1 111 1111 1111 1001 0000

FigureFigure 11. 11.Organization Organization of of weak weak row row register. register.

FigureFigure 12 shows 12 shows the the refresh refresh modes modes implementation. implementation. In In addition addition to the to therefresh refresh command, command, we we addedadded refresh refresh mode mode flags flags on on the the ADDR ADDR field field toto identifyidentify which which refresh refresh mode mode waswas performed performed in the in the refresh. Also, the command decoder inside DRAM needs the corresponding modification to identify refresh. Also, the command decoder inside DRAM needs the corresponding modification to identify refresh mode, and a register to record the refresh banks. Because we do not need to detect weak rows refreshon mode, the fly, and refresh a register commands to record for different the refresh refresh banks. bundles Because were prepared we do in not advance need to and detect sent to weak the rows on thememory fly, refresh controller commands by the CPU for dibasedfferent on our refresh integration bundles refresh were scheme, prepared so only in advancevery little andoverhead sent to the memoryon memory controller controller by the is CPU required. based on our integration refresh scheme, so only very little overhead on memory controller is required.

CLK

CMD REF REF REF

ADDR FLAGall-b FLAGper-b FLAGpar-b

Figure 12. Refresh commands with different flags.

5. Experimental Results We integrated Gem5 [16] and DRAMSim2 [17] as our simulation environment, Figure 13 shows our evaluation flow. Gem5 provided the execution trace and system performance in terms of instructions-per-clock (IPC), and DRAMSim2 provided a detailed active power, refresh power, and background power. We applied eight evaluation benchmarks from SPEC 2006, including four memory-bound benchmarks (mcf, lib, sjeng, cact) and four CPU-bound benchmarks (cal, xal, sop, bzip2). We used the default CPU and memory architecture settings of Gem5 and DRAMSim2, DRAM specification was based on the Micron MT41K1G8 datasheet [1], and the system configurations are shown in Table 3. Our experiments only classify memory cells into two categories: weak cells and normal cells, because more levels of weakness just increase the complexity of the refresh control but

Micromachines 2019, 10, x 10 of 20

modified accordingly. Basically, the weak row percentage is less than 0.5% in normal, and even in a worse environment this percentage is less than 1% for almost all operating conditions [3]. This means that less than 13440 bit (640 register entries) space is required for 32 Gb DRAM in standard conditions, and 67200 bit (3200 register entries) space is required for the extreme case of a 5% weak row percentage which has probability near zero. In addition, since the weak row profile is produced after wafer measurement and we do not need to change it on the fly, we can use Mask ROM or Nor Flash instead of logic registers to record these data, which greatly reduce the overhead. Therefore, the overhead is about 1 KB of Mask ROM or Nor Flash for most operating conditions, and less than 8 KB of Mask ROM or Nor Flash in the extreme case.

channel rank bank row 1bit 1bit 3bit 16bit 0 0 010 0011 1111 1111 0101

0 0 100 0011 0000 1111 0101

1 1 111 1111 1111 1001 0000

Figure 11. Organization of weak row register.

Figure 12 shows the refresh modes implementation. In addition to the refresh command, we added refresh mode flags on the ADDR field to identify which refresh mode was performed in the refresh. Also, the command decoder inside DRAM needs the corresponding modification to identify refresh mode, and a register to record the refresh banks. Because we do not need to detect weak rows on the fly, refresh commands for different refresh bundles were prepared in advance and sent to the Micromachinesmemory controller2019, 10, 590by the CPU based on our integration refresh scheme, so only very little overhead10 of 19 on memory controller is required.

CLK

CMD REF REF REF

ADDR FLAGall-b FLAGper-b FLAGpar-b

Figure 12. Refresh commands with different different flags.flags.

5. Experimental Results We integratedintegrated Gem5 [[16]16] and DRAMSim2 [[17]17] as our simulation environment, Figure 13 shows our evaluation flow.flow. Gem5 Gem5 provided provided the the executio executionn trace trace and and system performance in terms of instructions-per-clock (IPC), and DRAMSim2 provided provided a a detailed detailed active active power, power, refresh power, andand background power. We We applied applied eight eight evaluation benchmarks from SPEC 2006, including four memory-bound benchmarks (mcf, lib lib,, sjeng, cact) and four CPU-bound benchmarks (cal, xal, sop, bzip2). We used the default CPU and memory arch architectureitecture settings of Gem5 and DRAMSim2, DRAM specificationspecification was based on the Micron MT41K1G8 datasheetdatasheet [[1],1], and the systemsystem configurationsconfigurations are shownMicromachines in Table 2019,3 103.., Our xOur experimentsexperiments onlyonly classifyclassify memorymemory cellscells intointo twotwo categories: categories: weakweak cellscells11 andofand 20 normal cells, because more levels of weakness just increase the complexity of the refresh control but almost has no further improvement in reducing refr refreshesh overhead. To To compare the effect effect of different different refresh cycle times on didifferentfferent DRAM capacities, Table 3 3 lists lists thethe tRFCtRFC ofof all-bankall-bank auto-refreshauto-refresh forfor 16 Gb and and 32 32 Gb Gb DRAM DRAM capacities, capacities, while while tRFC tRFC of per-bank of per-bank auto-refresh, auto-refresh, half-bank half-bank auto-refresh, auto-refresh, and andquarter-bank quarter-bank auto-refresh auto-refresh are calculated are calculated as the as theratio ratio setting setting in Table in Table 2. 2We. We take take the the default default 1x 1xgranularity granularity all-bank all-bank auto-refresh auto-refresh as asthe the evaluation evaluation baseline, baseline, and and compare compare four four refresh schemes including 2x granularity all-bank auto-refresh, 1x granularity per-bank auto-refresh, 2x granularity half-bank auto-refresh (partial bank refresh), and the proposed 2x granularity integration refresh scheme. In aa normalnormal operatingoperating environment,environment, the the weak weak row row percentage percentage is aboutis about 1% 1% or lessor less than than 1% 1% [3]. Since[3]. Since 3D-stacked 3D-stacked architecture architecture is theis the future future trend, trend, we we monitor monitor this this operating operating environment environment inin which weak rowsrows increasedincreased duedue to to thermal thermal e ffeffectect on on cells’ cells’ retention retention time. time. For For this this reason, reason, we we evaluate evaluate our our 2x granularity2x granularity integration integration refresh refresh scheme scheme under under higher higher weak weak row row percentage percentage (1–5%) (1%–5%) with randomwith random weak rowsweak distribution.rows distribution.

Benchmark (SPEC2006)

Processor System Simulation (Gem5 Simulator) Result

Memory system DRAM Simulation (DRAMSim2) Result

Figure 13. Evaluation flow. flow.

Table 3. System configurations.

Processor 1 Core, 3.2 GHz L1 Cache L1-I cache 128KB, L1-D cache 128KB L2 Cache L2 cache 4 MB DRAM datasheet Micron MT41K1G8, 32 Gb [1] Memory frequency 800 MHz Number of channels, ranks 1 channel, 4 ranks per channel Number of banks, rows 8 banks per rank, 64K rows per bank 530 ns for 16 Gb, 890 ns for 32 Gb tRFCab, weak row percentage 1%, 2%,3%, 5% (Radom)

Figures 14–17 show the IPC improvement of the four 2x granularity refresh schemes (except per- bank refresh) over the default 1x granularity all-bank refresh baseline on the 16 Gb DRAM with tRFCab equal to 530 ns, and weak row percentage is 1%, 2%, 3%, and 5%, respectively. For the four memory-bound benchmarks, per-bank refresh scheme has worse results on the lib and sjeng benchmarks in comparison with the baseline, while our integration refresh scheme all get the best results except for the cat benchmark when weak row percentage is 5%. On the mcf benchmark, integration refresh scheme achieves near 20% of IPC improvement, and at least achieves over 5% of improvement on the sjeng benchmark. For the four CPU-bound benchmarks, our approach still has the best results except for the cact and cal benchmarks when the weak row percentage is 5%, but all get less than 5% of improvement.

Micromachines 2019, 10, 590 11 of 19

Table 3. System configurations.

Processor 1 Core, 3.2 GHz L1 Cache L1-I cache 128 KB, L1-D cache 128 KB L2 Cache L2 cache 4 MB DRAM datasheet Micron MT41K1G8, 32 Gb [1] Memory frequency 800 MHz Number of channels, ranks 1 channel, 4 ranks per channel Number of banks, rows 8 banks per rank, 64K rows per bank 530 ns for 16 Gb, 890 ns for 32 Gb tRFC , weak row percentage ab 1%, 2%, 3%, 5% (Radom)

Figures 14–17 show the IPC improvement of the four 2x granularity refresh schemes (except per-bank refresh) over the default 1x granularity all-bank refresh baseline on the 16 Gb DRAM with tRFCab equal to 530 ns, and weak row percentage is 1%, 2%, 3%, and 5%, respectively. For the four memory-bound benchmarks, per-bank refresh scheme has worse results on the lib and sjeng benchmarks in comparison with the baseline, while our integration refresh scheme all get the best results except for the cat benchmark when weak row percentage is 5%. On the mcf benchmark, integration refresh scheme achieves near 20% of IPC improvement, and at least achieves over 5% of improvement on the sjeng benchmark. For the four CPU-bound benchmarks, our approach still has the best results except for the cact and cal benchmarks when the weak row percentage is 5%, but all get

Micromachinesless than 5% 2019 of, improvement. 10, x 12 of 20

20%

15%

10%

5%

0% lib sjeng cact mcf cal xal sop bzip2 -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

FigureFigure 14.14.Instructions-per-clock Instructions-per-clock (IPC)(IPC) improvementimprovement normalizednormalized to 1x granularity baseline baseline (tRFC (tRFC == 530 ns, weak row = 1%). 530 ns, weak row = 1%).

Figures 18–21 show the IPC improvement of the four 2x granularity refresh schemes (except per-bank refresh) over the default 1x granularity all-bank refresh baseline on the 32 Gb DRAM with tRFCab equal20% to 890 ns, and weak row percentage is 1%, 2%, 3%, and 5%, respectively. For the four memory-bound benchmarks, our integration refresh scheme all get the best results. On the mcf 15% benchmark, integration refresh scheme achieves over 45% of IPC improvement, and at least achieves over 20% of10% improvement on the sjeng benchmark. While for the four CPU-bound benchmarks, our approach still has the best result especially on the high weak row percentage environment, and near 5% the per-bank refresh scheme on the low weak row percentage environment. Because of less memory access, the impact0% of refresh operations on IPC is degraded, all four refresh schemes achieve much less of IPC improvement onlib CPU-bound sjeng benchmarks, cact mcf but our approach calstill xal has near sop 10% bzip2 of improvement. -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 15. IPC improvement normalized to 1x granularity baseline (tRFC = 530 ns, weak row = 2%).

20%

15%

10%

5%

0% lib sjeng cact mcf cal xal sop bzip2 -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 16. IPC improvement normalized to 1x granularity baseline (tRFC = 530 ns, weak row = 3%).

Micromachines 2019, 10, x 12 of 20

20%

15%

10%

5%

0% lib sjeng cact mcf cal xal sop bzip2 -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

MicromachinesFigure 201914. Instructions-per-clock, 10, 590 (IPC) improvement normalized to 1x granularity baseline (tRFC =12 of 19 530 ns, weak row = 1%).

20%

15%

10%

5%

0% lib sjeng cact mcf cal xal sop bzip2 -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

FigureFigure 15.15. IPC improvement normalizednormalized toto 1x1x granularitygranularity baselinebaseline (tRFC(tRFC == 530 ns, weak row = 2%).2%).

20%

15%

10%

5%

0% lib sjeng cact mcf cal xal sop bzip2 -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

MicromachinesFigureFigure 16.201916. IPC, 10, improvementx normalized toto 1x1x granularitygranularity baselinebaseline (tRFC(tRFC == 530 ns, weak row = 3%).3%).13 of 20

20%

15%

10%

5%

0% lib sjeng cact mcf cal xal sop bzip2 -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 17.17. IPC improvement normalized toto 1x1x granularitygranularity baselinebaseline (tRFC(tRFC == 530 ns, weak row = 5%).5%).

Figures 18–21 show the IPC improvement of the four 2x granularity refresh schemes (except per- bank refresh) over the default 1x granularity all-bank refresh baseline on the 32 Gb DRAM with tRFCab equal to 890 ns, and weak row percentage is 1%, 2%, 3%, and 5%, respectively. For the four memory-bound benchmarks, our integration refresh scheme all get the best results. On the mcf benchmark, integration refresh scheme achieves over 45% of IPC improvement, and at least achieves over 20% of improvement on the sjeng benchmark. While for the four CPU-bound benchmarks, our approach still has the best result especially on the high weak row percentage environment, and near the per-bank refresh scheme on the low weak row percentage environment. Because of less memory access, the impact of refresh operations on IPC is degraded, all four refresh schemes achieve much less of IPC improvement on CPU-bound benchmarks, but our approach still has near 10% of improvement.

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 18. IPC improvement normalized to 1x granularity baseline (tRFC = 890 ns, weak row = 1%).

Micromachines 2019, 10, x 13 of 20

20%

15%

10%

5%

0% lib sjeng cact mcf cal xal sop bzip2 -5%

-10%

All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 17. IPC improvement normalized to 1x granularity baseline (tRFC = 530 ns, weak row = 5%).

Figures 18–21 show the IPC improvement of the four 2x granularity refresh schemes (except per- bank refresh) over the default 1x granularity all-bank refresh baseline on the 32 Gb DRAM with tRFCab equal to 890 ns, and weak row percentage is 1%, 2%, 3%, and 5%, respectively. For the four memory-bound benchmarks, our integration refresh scheme all get the best results. On the mcf benchmark, integration refresh scheme achieves over 45% of IPC improvement, and at least achieves over 20% of improvement on the sjeng benchmark. While for the four CPU-bound benchmarks, our approach still has the best result especially on the high weak row percentage environment, and near the per-bank refresh scheme on the low weak row percentage environment. Because of less memory access, the impact of refresh operations on IPC is degraded, all four refresh schemes achieve much Micromachinesless of IPC2019 improvement, 10, 590 on CPU-bound benchmarks, but our approach still has near 10%13 of of 19 improvement.

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

MicromachinesFigureFigure 18.201918. IPC, 10, improvementx normalizednormalized toto 1x1x granularitygranularity baselinebaseline (tRFC(tRFC == 890 ns, weak row = 1%).1%).14 of 20

50% 45% 40% 35% 30% 25% 20%

15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 19.19. IPC improvement normalized toto 1x1x granularitygranularity baselinebaseline (tRFC(tRFC=890=890 ns, weak row = 2%).2%).

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 20.20. IPC improvement normalized toto 1x1x granularitygranularity baselinebaseline (tRFC(tRFC == 890 ns, weak row = 3%).3%).

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 21. IPC improvement normalized to 1x granularity baseline (tRFC = 890 ns, weak row = 5%).

From the experimental results on different refresh cycle times, we see that as DRAM size increased, the 2x granularity integration refresh scheme has an even more significant effect on the performance improvement, especially for memory-bound applications. Comparing these four 2x

Micromachines 2019, 10, x 14 of 20

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 19. IPC improvement normalized to 1x granularity baseline (tRFC=890 ns, weak row = 2%).

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh Micromachines 2019, 10, 590 14 of 19 Figure 20. IPC improvement normalized to 1x granularity baseline (tRFC = 890 ns, weak row = 3%).

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 21.FigureIPC improvement21. IPC improvement normalized normalized to to 1x 1x granularity granularity baseline baseline (tRFC (tRFC = 890= ns,890 weak ns, row weak = 5%). row = 5%).

From the experimental results on different refresh cycle times, we see that as DRAM size Fromincreased, the experimental the 2x granularity results integration on different refresh refresh scheme cycle hastimes, an even we more see significant that as DRAM effect on size the increased, the 2x granularityperformance integration improvement, refresh especially scheme for memory has an-bound even moreapplications. significant Comparing effect these on the four performance 2x improvement, especially for memory-bound applications. Comparing these four 2x granularity refresh schemes, our integration refresh scheme almost has no performance degradation on all the benchmarks except cact when the weak row percentage increased from 1% to 5%. However, all three other refresh Micromachines 2019, 10, x 15 of 20 schemes have performance degradation as the weak row percentage increased, especially for the 2x granularitygranularity all-bank refresh refresh schemes, scheme. our integration Also, the refres performanceh scheme almost of per-bank has no performance refresh scheme degradation is not stable in comparisonon all with the benchmarks other refresh except schemes, cact when it hasthe weak even row worse percentage results increased than the from baseline 1% to 5%. on However, two benchmarks all three other refresh schemes have performance degradation as the weak row percentage increased, in the caseespecially of smaller for the refresh 2x granularity cycle time. all-bank The refresh reason scheme. that per-bankAlso, the performa refreshnce performed of per-bank badly refresh on the lib and sjengscheme benchmarks is not stable is due in comparison to various with memory other re accessfresh schemes, patterns it has on even different worse benchmarks.results than the Per-bank refresh schemebaseline performs on two benchmarks better on in disjointthe case of memory smaller refresh location cycle accessed time. The reason patterns, that per-bank while the refresh lib and sjeng benchmarksperformed have morebadly closedon the lib memory and sjeng location benchmarks accessed is due patterns. to various This memory shows access that patterns our approach on has different benchmarks. Per-bank refresh scheme performs better on disjoint memory location accessed the strongest ability to against thermal variation and refresh cycle time variation, and has the best patterns, while the lib and sjeng benchmarks have more closed memory location accessed patterns. performanceThis shows improvement. that our approach has the strongest ability to against thermal variation and refresh cycle Figurestime 22 variation,–25 show and the has refreshthe best energyperformance reduction improvement. of the four 2x granularity refresh schemes (except per-bank refresh)Figures over 22–25 the show default the refresh 1x granularity energy reduction all-bank of the refresh four 2x baseline granularity on therefresh 16 Gbschemes DRAM with tRFC equal(except to 530 per-bank ns, and refresh) the weakover the row default percentage 1x granularity is 1%, all-bank 2%, 3%, refresh and baseline 5%, respectively. on the 16 Gb Different DRAM with tRFC equal to 530 ns, and the weak row percentage is 1%, 2%, 3%, and 5%, respectively. from theDifferent experiments from the on experiments performance on evaluation,performance evaluation, both memory-bound both memory-bound benchmarks benchmarks and and CPU-bound benchmarksCPU-bound gained benchmarks an obvious gained improvement an obvious improvement in refresh energy in refresh reduction, energy reduction, CPU-bound CPU-bound benchmarks even obtainedbenchmarks more reductioneven obtained than mo memory-boundre reduction than benchmarks. memory-bound Also, benchmarks. our 2x granularity Also, our 2x integration refresh schemegranularity all integration get the best refresh results scheme except all get forthe best the results cact and except cal for benchmarks the cact and cal when benchmarks the weak row when the weak row percentage is 5%, and per-bank refresh scheme still performed badly for the lib percentage is 5%, and per-bank refresh scheme still performed badly for the lib and sjeng benchmarks. and sjeng benchmarks.

60%

50%

40%

30%

20%

10%

0% All-bank refresh Per-bank refresh lib sjeng cact mcf cal xal sop bzip2 Partial-bank refresh Integration refresh

Figure 22.FigureEnergy 22. Energy reduction reduction normalized normalized to to 1x 1x granularity granularity baseline baseline (tRFC (tRFC = 530= ns,530 weak ns, row weak = 1%). row = 1%).

60%

50%

40%

30%

20%

10%

0% All-bank refresh Per-bank refresh lib sjeng cact mcf cal xal sop bzip2 Partial-bank refresh Integration refresh

Figure 23. Energy reduction normalized to 1x granularity baseline (tRFC = 530 ns, weak row = 2%).

Micromachines 2019, 10, x 15 of 20 granularity refresh schemes, our integration refresh scheme almost has no performance degradation on all the benchmarks except cact when the weak row percentage increased from 1% to 5%. However, all three other refresh schemes have performance degradation as the weak row percentage increased, especially for the 2x granularity all-bank refresh scheme. Also, the performance of per-bank refresh scheme is not stable in comparison with other refresh schemes, it has even worse results than the baseline on two benchmarks in the case of smaller refresh cycle time. The reason that per-bank refresh performed badly on the lib and sjeng benchmarks is due to various memory access patterns on different benchmarks. Per-bank refresh scheme performs better on disjoint memory location accessed patterns, while the lib and sjeng benchmarks have more closed memory location accessed patterns. This shows that our approach has the strongest ability to against thermal variation and refresh cycle time variation, and has the best performance improvement. Figures 22–25 show the refresh energy reduction of the four 2x granularity refresh schemes (except per-bank refresh) over the default 1x granularity all-bank refresh baseline on the 16 Gb DRAM with tRFC equal to 530 ns, and the weak row percentage is 1%, 2%, 3%, and 5%, respectively. Different from the experiments on performance evaluation, both memory-bound benchmarks and CPU-bound benchmarks gained an obvious improvement in refresh energy reduction, CPU-bound benchmarks even obtained more reduction than memory-bound benchmarks. Also, our 2x granularity integration refresh scheme all get the best results except for the cact and cal benchmarks when the weak row percentage is 5%, and per-bank refresh scheme still performed badly for the lib and sjeng benchmarks.

60%

50%

40%

30%

20%

10%

0% All-bank refresh Per-bank refresh lib sjeng cact mcf cal xal sop bzip2 Partial-bank refresh Integration refresh Micromachines 2019, 10, 590 15 of 19 Figure 22. Energy reduction normalized to 1x granularity baseline (tRFC = 530 ns, weak row = 1%).

60%

50%

40%

30%

20%

10%

0% All-bank refresh Per-bank refresh lib sjeng cact mcf cal xal sop bzip2 Partial-bank refresh Integration refresh

FigureFigureMicromachinesMicromachines 23.23. Energy 20192019,, 1010 , reduction,reduction xx normalizednormalized toto 1x1x granularitygranularity baselinebaseline (tRFC(tRFC == 530 ns, weak row1616 == of of2%). 2020

60%60% 50%50% 40%40% 30%30% 20%20% 10%10% 0%0% liblib sjeng sjeng cact cact mcf cal xal xal sop sop bzip2 bzip2 -10%-10%

All-bankAll-bank refreshrefresh Per-bankPer-bank refresh Partial-bank refresh IntegrationIntegration refreshrefresh

FigureFigureFigure 24. Energy 24.24. EnergyEnergy reduction reductionreduction normalized normalized to to 1x 1x granularity granularity baseline baseline (tRFC (tRFC == 530530= 530 ns,ns, weakweak ns, weak rowrow == row 3%).3%).= 3%).

60%60%

50%50%

40%40%

30%30%

20%20%

10%10%

0%0% liblib sjeng sjeng cact cact mcf cal xal xal sop sop bzip2 bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 25. Energy reduction normalized to 1x granularity baseline (tRFC = 530 ns, weak row = 5%). FigureFigure 25. Energy 25. Energy reduction reduction normalized normalized to to 1x 1x granularity granularity baseline baseline (tRFC (tRFC = 530= 530 ns, weak ns, weak row = row 5%).= 5%). Figures 26–29 show the refresh energy reduction of the four 2x granularity refresh schemes over FiguresFigures 26–29 26–29 show show the the refresh refresh energy energy reduction reduction of of the the four four 2x 2x granularity granularity refresh refresh schemes schemes over over the default 1x granularity all-bank refresh baseline on the 32 Gb DRAM with tRFC equal to 890 ns, the default 1x granularity all-bank refresh baseline on the 32 Gb DRAM with tRFC equal to 890 ns, the defaultand weak 1x granularity row percentage all-bank is 1%, refresh 2%, 3%, baseline and 5%, on respectively. the 32 Gb Comparing DRAM with to tRFCthe results equal of tosmaller 890 ns, and and weak row percentage is 1%, 2%, 3%, and 5%, respectively. Comparing to the results of smaller weak rowrefresh percentage cycle time is above, 1%, 2%, our 3%, integration and 5%, refresh respectively. scheme further Comparing improves to theenergy results reduction of smaller for all refresh refresh cycle time above, our integration refresh scheme further improves energy reduction for all cycle timebenchmarks above, ouron all integration weak row percentage. refresh scheme further improves energy reduction for all benchmarks benchmarks on all weak row percentage. Comparing these four refresh schemes, our 2x granularity integration refresh scheme almost has on all weakComparing row percentage. these four refresh schemes, our 2x granularity integration refresh scheme almost has no degradation on refresh energy reduction for all the benchmarks except cact when the weak row no degradation on refresh energy reduction for all the benchmarks except cact when the weak row percentage increased from 1% to 5%. However, all three other refresh schemes have obvious percentage increased from 1% to 5%. However, all three other refresh schemes have obvious degradation as the weak row percentage increased, especially for the 2x granularity all-bank refresh degradation as the weak row percentage increased, especially for the 2x granularity all-bank refresh scheme, and as does the per-bank refresh scheme and 2x granularity half-bank refresh (partial-bank scheme, and as does the per-bank refresh scheme and 2x granularity half-bank refresh (partial-bank refresh) scheme when the weak row percentage increased to 5%. When DRAM size increased from refresh) scheme when the weak row percentage increased to 5%. When DRAM size increased from 16 Gb to 32 Gb, per-bank refresh scheme has the largest improvement especially for the lib and sjeng 16 Gb to 32 Gb, per-bank refresh scheme has the largest improvement especially for the lib and sjeng benchmarks because of the finer granularity, and hence performed better than 2x granularity partial- benchmarks because of the finer granularity, and hence performed better than 2x granularity partial- bank refresh scheme on all benchmarks. Therefore, we further prove that our approach has the bank refresh scheme on all benchmarks. Therefore, we further prove that our approach has the strongest ability to against thermal variation and refresh cycle time variation, and has the largest strongest ability to against thermal variation and refresh cycle time variation, and has the largest refresh energy reduction. refresh energy reduction.

Micromachines 2019, 10, 590 16 of 19 Micromachines 2019, 10, x 17 of 20 Micromachines 2019, 10, x 17 of 20

80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% lib sjeng cact mcf cal xal sop bzip2 All-banklib refresh sjengPer-bank cact refresh mcfPartial-bank cal refresh xalIntegration sop refresh bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 26. Energy reduction normalized to 1x granularity baseline (tRFC = 890 ns, weak row = 1%). FigureFigure 26. Energy 26. Energy reduction reduction normalized normalized to to 1x 1x granularitygranularity baseline baseline (tRFC (tRFC = 890= 890ns, weak ns, weak row = row 1%).= 1%).

80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% lib sjeng cact mcf cal xal sop bzip2 All-banklib refresh sjengPer-bank cact refresh mcfPartial-bank cal refresh xalIntegration sop refresh bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 27. Energy reduction normalized to 1x granularity baseline (tRFC = 890 ns, weak row = 2%). FigureFigure 27. Energy 27. Energy reduction reduction normalized normalized to to 1x 1x granularitygranularity baseline baseline (tRFC (tRFC = 890= 890ns, weak ns, weak row = row 2%).= 2%).

80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% lib sjeng cact mcf cal xal sop bzip2 All-banklib refresh sjengPer-bank cact refresh mcfPartial-bank cal refresh xalIntegration sop refresh bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

Figure 28. Energy reduction normalized to 1x granularity baseline (tRFC = 890 ns, weak row = 3%). FigureFigure 28. Energy 28. Energy reduction reduction normalized normalized to to 1x 1x granularitygranularity baseline baseline (tRFC (tRFC = 890= 890ns, weak ns, weak row = row 3%).= 3%).

Comparing these four refresh schemes, our 2x granularity integration refresh scheme almost has no degradation on refresh energy reduction for all the benchmarks except cact when the weak row percentage increased from 1% to 5%. However, all three other refresh schemes have obvious degradation as the weak row percentage increased, especially for the 2x granularity all-bank refresh scheme, and as does the per-bank refresh scheme and 2x granularity half-bank refresh (partial-bank refresh) scheme when the weak row percentage increased to 5%. When DRAM size increased from Micromachines 2019, 10, 590 17 of 19

16 Gb to 32 Gb, per-bank refresh scheme has the largest improvement especially for the lib and sjeng benchmarks because of the finer granularity, and hence performed better than 2x granularity partial-bank refresh scheme on all benchmarks. Therefore, we further prove that our approach has the strongest ability to against thermal variation and refresh cycle time variation, and has the largest refresh energy reduction. Micromachines 2019, 10, x 18 of 20

80% 70% 60% 50% 40% 30% 20% 10% 0% lib sjeng cact mcf cal xal sop bzip2 All-bank refresh Per-bank refresh Partial-bank refresh Integration refresh

FigureFigure 29. Energy 29. Energy reduction reduction normalized normalized to to 1x 1x granularitygranularity baseline baseline (tRFC (tRFC = 890= 890ns, weak ns, weak row = row 5%).= 5%).

6. Conclusions6. Conclusions In thisIn paper,this paper, we proposewe propose an an integration integration schemescheme for for DRAM DRAM refresh refresh based based on retention-aware on retention-aware auto-refreshauto-refresh and and 2x granularity2x granularity auto-refresh auto-refresh techniquestechniques simultaneously. simultaneously. Since Since the profile the profile of weak of weak cells distribution in memory banks are obtained after testing and measurement analysis of memory cells distribution in memory banks are obtained after testing and measurement analysis of memory wafer, we do not need to detect weak rows on the fly. In addition, since the profile are produced after wafer,wafer we do measurement, not need to we detect either weak do not rows need on to the change fly. Inrefresh addition, scheme since on the the fly, profile refresh are commands produced after waferfor measurement, different refresh we eitherbundles do are not prepared need to changebased on refresh our integration scheme onrefresh the fly, scheme refresh in commandsadvance. for differentTherefore, refresh the bundles hardware are cost prepared and control based complexity on our integration on memory refresh controller scheme design in advance.to support Therefore,this the hardwareintegration cost refresh and controlscheme complexityis acceptable on and memory reasonable. controller Experimental design toresults support show this that integration our refreshretention-aware scheme is acceptable integration and refresh reasonable. scheme can Experimental properly improve results the system show performance that our retention-aware and have integrationa great refresh reduction scheme in DRAM can properlyrefresh energy. improve From the the system results on performance different refresh and havecycle time, a great we reduction see that as DRAM size increased, the 2x granularity integration refresh scheme has even more significant in DRAM refresh energy. From the results on different refresh cycle time, we see that as DRAM effect on the performance improvement, especially for memory-bound applications. Especially when size increased,the number the of 2xweak granularity cells percentage integration increased refresh due toscheme thermal haseffect even of 3D-stacked more significant architecture, effect our on the performanceapproach improvement, has the strongest especially ability to against for memory-bound thermal variation applications. and has the smallest Especially refresh when overhead. the number of weak cellsMost percentage important of increased all, our integr dueation to thermal refresh scheme effect of has 3D-stacked high extension architecture, flexibility. ourEven approach though has the strongestfor finer abilitygrain, for to example, against 4x thermal granularity, variation our methodology and has the can smallest extend to refresh suit different overhead. granularities Mosteasily, important and the extra of all,hardware our integration cost and control refresh complexity scheme hasis acceptable high extension and reasonable flexibility. as described Even though for finerabove. grain, Also, for as example, described 4x previously, granularity, although our methodology 2x granularity canreduces extend refresh to suitinterval diff erentto one granularities half of that in 1x granularity, its refresh cycle time exceeds one half of that in 1x granularity. Therefore, not easily, and the extra hardware cost and control complexity is acceptable and reasonable as described only finer grain individually can reduce refresh overhead, the integration with a refresh scheme like above.we Also, propose as described is necessary. previously, Since an increasing although DRAM 2x granularity size and 3D-stacked reduces refresharchitecture interval are the to future one half of that indirections, 1x granularity, increased its refresh refresh cycle cycle time time and exceedsweak row one percentage half of will that be in serious 1x granularity. issues on the Therefore, refresh not only finerproblem, grain our individually methodology can is indeed reduce adequate refresh and overhead, efficient the to reduce integration refresh with overhead. a refresh Moreover, scheme like we proposethere is isserious necessary. thermal Since problem anincreasing on the 3D-stacked DRAM architecture, size and 3D-stacked the retention architecture time of cells may are vary thefuture directions,as temperature increased changes refresh in cycle the CPU time execution and weak stage, row and percentage causes the will number be serious of weak issues cells and on weak the refresh problem,cells ourdistribution methodology to worsen. is indeed This phenomenon adequate and complicates efficient the to reducerefresh problem, refresh overhead.the overhead Moreover, to there isdetect serious weak thermal rows and problem change on refresh the 3D-stacked schemes on architecture, the fly is not the retentiona simple fix. time Therefore, of cells may more vary as measurement and analysis of cells weakness under different thermal environment is necessary. With temperature changes in the CPU execution stage, and causes the number of weak cells and weak cells more detailed weak cell profiles under thermal variation, the refresh scheme can take this issue into distributionaccount toand worsen. only needs This to phenomenondetect thermal variation, complicates avoids the to refresh detect weak problem, rows and the change overhead refresh todetect weakschemes rows and on changethe fly, which refresh is important schemes onfuture the work. fly is not a simple fix. Therefore, more measurement and analysis of cells weakness under different thermal environment is necessary. With more detailed Author Contributions: W.-K.C. designed the algorithm, supervised the work, and wrote the paper; P.-Y.S and X.-L.L. designed and performed the experiments, and analyzed the data.

Micromachines 2019, 10, 590 18 of 19 weak cell profiles under thermal variation, the refresh scheme can take this issue into account and only needs to detect thermal variation, avoids to detect weak rows and change refresh schemes on the fly, which is important future work.

Author Contributions: W.-K.C. designed the algorithm, supervised the work, and wrote the paper; P.-Y.S. and X.-L.L. designed and performed the experiments, and analyzed the data. Funding: This work was supported in part by the Ministry of Science and Technology, Taiwan, under grant number MOST 108-2218-E-033-003. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Mircon Corp. 8 Gb: x4, x8 TwinDie DDR3L SDRAM. 2011. Available online: https://www.micron.com/ support#SupportDocumentationandDownloads (accessed on 8 September 2019). 2. JEDEC. DDR3 SDRAM Specification. 2012. Available online: https://www.jedec.org/document_search? search_api_views_fulltext=79-3F (accessed on 8 September 2019). 3. Liu, J.; Jaiyen, B.; Veras, R.; Mutlu, O. RAIDR: Retention-Aware Intelligent DRAM Refresh. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 9–13 June 2012; pp. 1–12. 4. Cheng, W.K.; Shen, P.Y. Retention-aware Refresh Techniques for DRAM Refresh Power Reduction. In Proceedings of the 20th Workshop on Synthesis and System Integration of Mixed Information Technologies (SASIMI), Kyoto, Japan, 24–25 October 2016; pp. 90–93. 5. Cheng, W.K.; Li, X.L.; Chen, J.K. Integration Scheme for Retention-aware DRAM Refresh. In Proceedings of the 2017 International Conference on Electron Devices and Solid-State Circuits (EDSSC), Hsinchu, Taiwan, 18–20 October 2017. 6. Cheng, W.K.; Li, X.L.; Chen, J.K. DRAM Refresh Improvement with Bank Reordering. In Proceedings of the 2018 7th International Symposium on Next-Generation Electronics (ISNE), Taipei, Taiwan, 7–9 May 2018. 7. Thakkar, I.G.; Pasricha, S. Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures. In Proceedings of the 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID), Kolkata, India, 4–8 January 2016. 8. Cheng, W.K.; Chen, J.K.; Huang, S.H. Integration of Retention-aware Refresh and BISR Techniques for DRAM Refresh Power Reduction. In Proceedings of the 2018 International SOC Design Conference (ISOCC), Daegu, Korea, 12–15 November 2018. 9. Chang, K.K.; Lee, D.; Chishti, Z.; Alameldeen, A.R.; Wilkerson, C.; Kim, Y.; Mutlu, O. Improving DRAM Performance by Parallelizing Refreshes with Accesses. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, USA, 15–19 February 2014. 10. Gong, Y.-H.; Chun, S.W. Exploiting refresh effect of dram read operations: A practical approach to low-power refresh. IEEE Trans. Comput. 2016, 65, 1507–1517. [CrossRef] 11. Bhati, I.; Chishti, Z.; Lu, S.-L.; Jacob, B. Flexible Auto-Refresh: Enabling Scalable and Energy-Efficient DRAM Refresh Reductions. In Proceedings of the 2015/ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 13–17 June 2015; pp. 235–245. 12. Bhati, I.; Chishti, Z.; Jacob, B. Coordinated Refresh: Energy Efficient Techniques for DRAM Refresh Scheduling. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), Beijing, China, 4–6 September 2013; pp. 205–210. 13. Guo, Y.; Huang, P.; Young, B.; Lu, T.; He, X.; Liu, Q.G. Alleviating DRAM Refresh Overhead via Inter-rank Piggyback Caching. In Proceedings of the 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Atlanta, GA, USA, 5–7 October 2015; pp. 23–32. 14. JEDEC. LPDDR3 SDRAM Specification. 2015. Available online: https://www.jedec.org/document_search? search_api_views_fulltext=209-3C (accessed on 8 September 2019). 15. JEDEC. DDR4 SDRAM Specification; No.79-4B. 2017. Available online: https://www.jedec.org/document_ search?search_api_views_fulltext=79-4B (accessed on 8 September 2019). Micromachines 2019, 10, 590 19 of 19

16. Binkert, N.; Beckmann, B.; Black, G.; Reinhardt, S.K.; Saidi, A.; Basu, A.; Hestness, J.; Hower, D.R.; Krishna, T.; Sardashti, S.; et al. The gem5 Simulator. ACM SIGARCH Comput. Archit. News 2011, 39, 1–7. [CrossRef] 17. Rosenfeld, P.; Cooper-Balis, E.; Jacob, B. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Comput. Archit. Letters 2011, 10, 16–19. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).