<<

Review: DRAM Controller: Functions

 Ensure correct operation of DRAM (refresh and timing) :  Service DRAM requests while obeying timing constraints of Main Memory (Part II) DRAM chips  Constraints: resource conflicts (bank, , channel), minimum write-to-read delays

 Translate requests to DRAM command sequences

 Buffer and schedule requests to improve performance Prof. Onur Mutlu  Reordering, row-buffer, bank, rank, bus management Carnegie Mellon University

 Manage power consumption and thermals in DRAM (reorged & cut by Seth)  Turn on/off DRAM chips, manage power modes 2

DRAM Review: Why are DRAM Controllers Difficult to Design?

 DRAM chips have power modes  Need to obey DRAM timing constraints for correctness

 Idea: When not accessing a chip power it down  There are many (50+) timing constraints in DRAM  tWTR: Minimum number of cycles to wait before issuing a read command after a write command is issued  Power states  tRC: Minimum number of cycles between the issuing of two  Active (highest power) consecutive activate commands to the same bank  All banks idle  …  Power-down  Need to keep track of many resources to prevent conflicts  Self-refresh (lowest power)  Channels, banks, ranks, data bus, address bus, row buffers

 Need to handle DRAM refresh  Tradeoff: State transitions incur latency during which the chip cannot be accessed  Need to optimize for performance (in the presence of constraints)  Reordering is not simple

 Predicting the future?

3 4 Review: Many DRAM Timing Constraints Review: More on DRAM Operation

 Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012.

 Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.

 From Lee et al., “DRAM-Aware Last-Level Writeback: Reducing Write-Caused Interference in Memory Systems,” HPS Technical Report, April 2010.

5 6

Self-Optimizing DRAM Controllers Self-Optimizing DRAM Controllers

 Problem: DRAM controllers difficult to design  It is difficult for  Engin Ipek, Onur Mutlu, José F. Martínez, and Rich human designers to design a policy that can adapt itself very well Caruana, to different workloads and different system conditions "Self Optimizing Memory Controllers: A Reinforcement Learning Approach"  Idea: Design a memory controller that adapts its scheduling Proceedings of the policy decisions to workload behavior and system conditions Computer Architecture , pages 39-50, Beijing, using machine learning. Goal:China, Learn toJune choose 2008. actions to maximize r + r + 2r + … ( 0 < 1) 35th International0  1 Symposium2  on

 Observation: Reinforcement learning maps nicely to memory (ISCA control. )

 Design: Memory controller is a reinforcement learning agent that dynamically and continuously learns and employs the best scheduling policy.

Ipek+, “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008.7 8 Self-Optimizing DRAM Controllers Self-Optimizing DRAM Controllers

 Dynamically adapt the memory scheduling policy via  Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning interaction with the system at runtime Approach"  Associate system states and actions (commands) with long term Proceedings of the reward values (ISCA , pages 39-50, Beijing, China, June 2008. )  Schedule command with highest estimated long-term value in each 35th International Symposium on Computer Architecture state

 Continuously update state-action values based on feedback from system

9 10

States, Actions, Rewards Performance Results

❖ Reward function ❖ State attributes ❖ Actions • +1 for scheduling • Number of reads, • Activate Read and Write writes, and load Write commands misses in • transaction queue • 0 at all other • Read - load miss times • Number of pending • Read - store miss writes and ROB heads waiting for • Precharge - pending referenced row • Precharge - preemptive • Request’s relative • NOP ROB order

11 12 Self Optimizing DRAM Controllers

 Advantages + Adapts the scheduling policy dynamically to changing workload behavior and to maximize a long-term target + Reduces the designer’s burden in finding a good scheduling Trends Affecting Main Memory policy. Designer specifies: 1) What system variables might be useful 2) What target to optimize, but not how to optimize it

 Disadvantages -- Black box: designer much less likely to implement what she cannot easily reason about -- How to specify different reward functions that can achieve different objectives? (e.g., fairness, QoS)

13

Major Trends Affecting Main Memory (I) Major Trends Affecting Main Memory (II)

 Need for main memory capacity, bandwidth, QoS increasing  Need for main memory capacity, bandwidth, QoS increasing

 Multi-core: increasing number of cores

 Data-intensive applications: increasing demand/hunger for data

 Consolidation: cloud computing, GPUs, mobile

 Main memory energy/power is a key system design concern

 Main memory energy/power is a key system design concern

 DRAM technology scaling is ending  DRAM technology scaling is ending

15 16 Major Trends Affecting Main Memory (III) Major Trends Affecting Main Memory (IV)

 Need for main memory capacity, bandwidth, QoS increasing  Need for main memory capacity, bandwidth, QoS increasing

 Main memory energy/power is a key system design concern

 ~40-50% energy spent in off-chip [Lefurgy,  Main memory energy/power is a key system design concern IEEE Computer 2003]

 DRAM consumes power even when not used (periodic refresh)

 DRAM technology scaling is ending  DRAM technology scaling is ending  ITRS projects DRAM will not scale easily below X nm

 Scaling has provided many benefits:  higher capacity (density), lower cost, lower energy

17 18

The DRAM Scaling Problem Solution 1: Tolerate DRAM

 DRAM stores charge in a (charge-based memory)  Overcome DRAM shortcomings with

 Capacitor must be large enough for reliable sensing  System-DRAM co-design

 Access transistor should be large enough for low leakage and high  Novel DRAM architectures, interface, functions retention time  Better waste management (efficient utilization)  Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]  Key issues to tackle

 Reduce refresh energy

 Improve bandwidth and latency

 Reduce waste

 Enable reliability at low cost

 Liu, Jaiyen, Veras, Mutlu, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

 Kim, Seshadri, Lee+, “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012.

 Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013.  DRAM capacity, cost, and energy/power hard to scale  Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices” ISCA’13.  Seshadri+, “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” 2013. 19 20 New DRAM Architectures

 RAIDR: Reducing Refresh Impact Tolerating DRAM:  TL-DRAM: Reducing DRAM Latency System-DRAM Co-Design  SALP: Reducing Bank Conflict Impact  RowClone: Fast Bulk Data Copy and Initialization

22

DRAM Refresh

 DRAM capacitor charge leaks over time

RAIDR: Reducing  The memory controller needs to refresh each row DRAM Refresh Impact periodically to restore charge  Activate + precharge each row every N ms

 Typical N = 64 ms

 Downsides of refresh -- Energy consumption: Each refresh consumes energy -- Performance degradation: DRAM rank/bank unavailable while refreshed -- QoS/predictability impact: (Long) pause times during refresh -- Refresh rate limits DRAM density scaling

24 Refresh Today: Auto Refresh Refresh Overhead: Performance Columns Rows BANK 0 BANK 1 BANK 2 BANK 3

46%

Row Buffer

DRAM Bus 8%

A batch of rows are DRAM CONTROLLER periodically refreshed via the auto-refresh command 25 26

Refresh Overhead: Energy Problem with Conventional Refresh

 Today: Every row is refreshed at the same rate

47%

15%

 Observation: Most rows can be refreshed much less often without losing data [Kim+, EDL’09]

 Problem: No support in DRAM for different refresh rates per row

27 28 Retention Time of DRAM Rows Reducing DRAM Refresh Operations

 Observation: Only very few rows need to be refreshed at the  Idea: Identify the retention time of different rows and worst-case rate refresh each row at the frequency it needs to be refreshed

 (Cost-conscious) Idea: Bin the rows according to their minimum retention times and refresh rows in each bin at the refresh rate specified for the bin

 e.g., a bin for 64-128ms, another for 128-256ms, …

 Observation: Only very few rows need to be refreshed very frequently [64-128ms]  Have only a few bins  Low HW overhead to achieve large reductions in refresh operations

 Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.

 Can we exploit this to reduce refresh operations at low cost?

29 30

RAIDR: Mechanism 1. Profiling

1. Profiling: Profile the retention time of all DRAM rows  can be done at DRAM design time or dynamically

2. Binning: Store rows into bins by retention time  use Bloom Filters for efficient and scalable storage 1.25KB storage in controller for 32GB DRAM memory

3. Refreshing: Memory controller refreshes rows in different bins at different rates  probe Bloom Filters to determine refresh rate of a row

31 32 2. Binning Bloom Filter Operation Example

 How to efficiently and scalably store rows into retention time bins?

 Use Hardware Bloom Filters [Bloom, CACM 1970]

33 34

Bloom Filter Operation Example Bloom Filter Operation Example

35 36 Bloom Filter Operation Example Benefits of Bloom Filters as Bins

 False positives: a row may be declared present in the Bloom filter even if it was never inserted

 Not a problem: Refresh some rows more frequently than needed

 No false negatives: rows are never refreshed less frequently than needed (no correctness problems)

 Scalable: a Bloom filter never overflows (unlike a fixed-size table)

 Efficient: No need to store info on a per-row basis; simple hardware  1.25 KB for 2 filters for 32 GB DRAM system

37 38

3. Refreshing (RAIDR Refresh Controller) 3. Refreshing (RAIDR Refresh Controller)

Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. 39 40 Tolerating Temperature Changes RAIDR: Baseline Design

Refresh control is in DRAM in today’s auto-refresh systems RAIDR can be implemented in either the controller or DRAM

41 42

RAIDR in Memory Controller: Option 1 RAIDR in DRAM Chip: Option 2

Overhead of RAIDR in DRAM controller: Overhead of RAIDR in DRAM chip: 1.25 KB Bloom Filters, 3 counters, additional commands Per-chip overhead: 20B Bloom Filters, 1 (4 Gbit chip) issued for per-row refresh (all accounted for in evaluations) Total overhead: 1.25KB Bloom Filters, 64 counters (32 GB DRAM)

43 44 RAIDR Results RAIDR Refresh Reduction

 Baseline: 32 GB DDR3 DRAM system  32 GB DDR3 DRAM system (8 cores, 512KB cache/core)

 64ms refresh interval for all rows

 RAIDR:

 64–128ms retention range: 256 B Bloom filter, 10 hash functions

 128–256ms retention range: 1 KB Bloom filter, 6 hash functions

 Default refresh interval: 256 ms

 Results on SPEC CPU2006, TPC-C, TPC-H benchmarks

 74.6% refresh reduction

 ~16%/20% DRAM dynamic/idle power reduction

 ~9% performance improvement

45 46

RAIDR: Performance RAIDR: DRAM Energy Efficiency

RAIDR performance benefits increase with workload’s memory intensity RAIDR energy benefits increase with memory idleness 47 48 DRAM Device Capacity Scaling: Performance DRAM Device Capacity Scaling: Energy

RAIDR performance benefits increase with DRAM chip capacity RAIDR energy benefits increase with DRAM chip capacity RAIDR slides 49 50

More Readings Related to RAIDR New DRAM Architectures

 Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu,  RAIDR: Reducing Refresh Impact "An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling  TL-DRAM: Reducing DRAM Latency Mechanisms"  SALP: Reducing Bank Conflict Impact Proceedings of the Architecture , Tel-Aviv, Israel, June 2013. Slides (pptx) Slides  RowClone: Fast Bulk Data Copy and Initialization (pdf) (ISCA40th International Symposium on Computer )

51 52 Historical DRAM Latency‐Capacity Trend Capacity Latency (tRC) 2.5 100 Tiered-Latency DRAM: 16X 2.0 80 (Gb) Reducing DRAM Latency (ns)

1.5 60 1.0 ‐20% 40 Latency

Capacity 0.5 20 Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu, "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture" 0.0 0 19th International Symposium on High-Performance Computer Architecture , 2000 2003 2006 2008 2011 Shenzhen, China, February 2013. Slides (pptx) Year DRAM latency continues to be a critical bottleneck 54

(HPCA )

What Causes the Long Latency? What Causes the Long Latency? DRAM Chip subarray DRAM Chip row decoder subarray cellsubarraybanks array cellsubarraybanks array

cell array cell array amplifier

addr.

Subarray sense row

cell mux I/OI/O wordline I/OI/O I/O column addr. channel channel access decoder

transistor capacitor bitline DRAM Latency = Subarray Latency Latency ++ I/OI/O LatencyLatency row

sense amplifier 55 Dominant 56 Why is the Subarray So Slow? Trade‐Off: Area (Die Size) vs. Latency Subarray Cell Long Bitline Short Bitline cell wordline cells

512 access

decoder transistor decoder Faster

amplifier

bitline capacitor row row bitline:

sense Smaller sense amplifier large sense amplifier • Long bitline Trade‐Off: Area vs. Latency – Amortizes sense amplifier cost  Small area – Large bitline capacitance  High latency & power 57 58

Trade‐Off: Area (Die Size) vs. Latency Approximating the Best of Both Worlds 4 32 Long Bitline Our Proposal Short Bitline Area 3 Small Area Large Area Fancy DRAM Commodity Short Bitline 64 DRAM High Latency Low Latency DRAM 2 Long Bitline 128 Cheaper 1 256 512 cells/bitline

Normalized 0 0 10203040506070 Latency (ns) Need Add Isolation Faster Isolation TransistorsShort Bitline  Fast 59 60 Approximating the Best of Both Worlds Tiered‐Latency DRAM • Divide a bitline into two segments with an Long BitlineTieredOur‐Latency Proposal DRAMShort Bitline isolation transistor Small Area Small Area Large Area High Latency Low Latency Low Latency

Far Segment Small area using long bitline Isolation Transistor Low Latency Near Segment Sense Amplifier 61 62

Near Segment Access Far Segment Access • Turn off the isolation transistor • Turn on the isolation transistor

Long bitline length Reduced bitline length Large bitline capacitance Reduced bitline capacitance Additional resistance of isolation transistor  Low latency & lowFar power Segment  High latency & highFar powerSegment

Isolation Transistor (off) Isolation Transistor (on) Near Segment Near Segment Sense Amplifier Sense Amplifier 63 64 Latency, Power, and Area Evaluation Commodity DRAM vs. TL‐DRAM • Commodity DRAM: 512 cells/bitline • DRAM Latency (tRC) • DRAM Power

• TL‐DRAM: 512 cells/bitline 150% 150% +49% – Near segment: 32 cells +23% (52.5ns) – Far segment: 480 cells 100% 100% • Latency Evaluation –56%

Power –51% Latency 50% 50% – SPICE simulation using circuit‐level DRAM model • 0% 0% Power and Area Evaluation Near Far Near Far – DRAM area/power simulator from Rambus Commodity Commodity DRAM TL‐DRAM DRAM TL‐DRAM – DDR3 energy calculator from Micron • DRAM Area Overhead ~3%: mainly due to the isolation transistors 65 66

Latency vs. Near Segment Length Latency vs. Near Segment Length

80 80 Near Segment Far Segment Near Segment Far Segment 60 60 (ns) (ns)

40 40

20 20 Latency Latency 0 0 1248163264128256512 1248163264128256512 Near Segment Length (Cells) Ref. Near Segment Length (Cells) Ref. Far Segment Length = 512 –Near Segment Length Longer near segment length leads to Far segment latency is higher than

higher near segment latency 67 commodity DRAM latency 68 Trade‐Off: Area (Die‐Area) vs. Latency Leveraging Tiered‐Latency DRAM 4 32 • TL‐DRAM is a substrate that can be leveraged by

Area the hardware and/or software 3 • 64 Many potential uses DRAM 2 128 1. Use near segment as hardware‐managed inclusive 256 512 cells/bitline cache to far segment Cheaper 1 2. Use near segment as hardware‐managed exclusive Near Segment Far Segment cache to far segment

Normalized 0 3. Profile‐based page mapping by operating system 0 10203040506070 4. Simply replace DRAM with TL‐DRAM Latency (ns) Faster 69 70

Near Segment as Hardware‐Managed Cache Inter‐Segment Migration TL‐DRAM • Goal: Migrate source row into destination row subarray main • Naïve way: Memory controller reads the source row far segment memory byte by byte and writes to destination row byte by byte near segment cache → High latency sense amplifier Source I/O Far Segment channel Isolation Transistor • Challenge 1: How to efficiently migrate a row between Destination segments? Near Segment • Challenge 2: How to efficiently manage the cache? Sense Amplifier 71 72 Inter‐Segment Migration Inter‐Segment Migration • Our way: • Our way: – Source and destination cells share bitlines – Source and destination cells share bitlines – Transfer data from source to destination across – Transfer data from source to destination across shared bitlines concurrently shared bitlines concurrentlyStep 1: Activate source row Source Migration is overlapped with source row access Far Segment Additional ~4ns over rowFar Segmentaccess latency Step 2: Activate destination row to connect cell and bitline Isolation Transistor Isolation Transistor Destination Near Segment Near Segment Sense Amplifier Sense Amplifier 73 74

Near Segment as Hardware‐Managed Cache Evaluation Methodology TL‐DRAM • System simulator – CPU: Instruction‐trace‐based simulator subarray main far segment memory – Memory: Cycle‐accurate DDR3 DRAM simulator near segment cache sense amplifier • Workloads – 32 Benchmarks from TPC, STREAM, SPEC CPU2006 I/O channel • Performance Metrics – Single‐core: Instructions‐Per‐Cycle • Challenge 1: How to efficiently migrate a row between – Multi‐core: Weighted speedup segments? • Challenge 2: How to efficiently manage the cache?

75 76 Configurations Performance & Power Consumption • System configuration 120% 12.4% 11.5% 10.7% 120% – CPU: 5.3GHz –23% –24% –26% 100% 100% – LLC: 512kB private per core 80% 80% Power – Memory: DDR3‐1066

Performance 60% 60%

• 1‐2 channel, 1 rank/channel

• 8 banks, 32 subarrays/bank, 512 cells/bitline 40% 40% • Row‐interleaved mapping & closed‐row policy 20% 20% Normalized

Normalized 0% 0% • TL‐DRAM configuration 1 (1‐ch) 2 (2‐ch) 4 (4‐ch) 1 (1‐ch) 2 (2‐ch) 4 (4‐ch) – Total bitline length: 512 cells/bitline Core‐Count (Channel) Core‐Count (Channel) – Near segment length: 1‐256 cells Using near segment as a cache improves – Hardware‐managed inclusive cache: near segment performance and reduces power consumption 77 78

Single‐Core: Varying Near Segment Length Other Mechanisms & Results Maximum IPC • 14% More mechanisms for leveraging TL‐DRAM Improvement – Hardware‐managed exclusive caching mechanism 12% – Profile‐based page mapping to near segment 10% – TL‐DRAM improves performance and reduces power 8% consumption with other mechanisms Improvement Larger cache capacity 6% • More than two tiers 4% Higher cache access latency – Latency evaluation for three‐tier TL‐DRAM 2% • Detailed circuit evaluation 0%

Performance for DRAM latency and power consumption 1 2 4 8 16 32 64 128 256 – Examination of tRC and tRCD Near Segment Length (cells) • Implementation details and storage cost analysis in By adjusting the near segment length, we can memory controller trade off cache capacity for cache latency 79 80 Summary of TL‐DRAM New DRAM Architectures • Problem: DRAM latency is a critical performance bottleneck

• Our Goal: Reduce DRAM latency with low area cost  RAIDR: Reducing Refresh Impact • Observation: Long bitlines in DRAM are the dominant source  TL-DRAM: Reducing DRAM Latency of DRAM latency  SALP: Reducing Bank Conflict Impact

• Key Idea: Divide long bitlines into two shorter segments  RowClone: Fast Bulk Data Copy and Initialization – Fast and slow segments • Tiered‐latency DRAM: Enables latency heterogeneity in DRAM – Can leverage this in many ways to improve performance and reduce power consumption • Results: When the fast segment is used as a cache to the slow segment  Significant performance improvement (>12%) and power reduction (>23%) at low area cost (3%) 81 82

The Memory Bank Conflict Problem

 Two requests to the same bank are serviced serially

 Problem: Costly in terms of performance and power

Subarray-Level Parallelism:  Goal: We would like to reduce bank conflicts without Reducing Bank Conflict Impact increasing the number of banks (at low cost)

 Idea: Exploit the internal sub-array structure of a DRAM bank to parallelize bank conflicts

 By reducing global sharing of hardware between sub-arrays Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu, "A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM" Proceedings of the ,  Kim, Seshadri, Lee, Liu, Mutlu, “A Case for Exploiting Portland, OR, June 2012. Slides (pptx) Subarray-Level Parallelism in DRAM,” ISCA 2012. 39th International Symposium on Computer Architecture

84

(ISCA ) The Problem with Memory Bank Conflicts Goal • Two Banks Served in parallel  Goal:of Mitigate the detrimental effects Bank Wr Rd bank conflicts time manner Bank Wr Rd time in a cost-effective

• One Bank Wasted  Naïve solution: Add more banks  Very expensive Bank Wr 2Wr3WrWrRd2 2Rd3 Rd 3 Rd time 2.1.3. WriteSerializationThrashing Penalty Row‐Buffer  Cost-effective solution: Approximate 85 the benefits of more banks without 86

Key Observation #1 Key Observationstructures #2 Each subarray is mostly A DRAM bank is divided into independent…  except occasionally sharing global subarraysLogical Bank Physical Bank Row LocalSubarray Row‐Buf64 LocalSubarray Row‐Buf64 Row 32k rows

Row Decoder

Row LocalSubarray Row‐Buf1 ∙∙∙ Subarray Row‐Buffer Global Row‐Buf Global Local Row‐Buf1 A single row‐buffer Many local row‐buffers, cannot drive all rows one at each subarray Bank Global Row‐Buf 87 88 Key Idea: Reduce Sharing of Globals Overview of Our Mechanism 1. Parallel access to subarrays

Subarray Local Row‐Buf 64 Local Row‐Buf 1. Parallelize

Decoder Req Req Req Req

∙∙∙ ∙∙∙ 2. Utilize multiple Tolocal same row bank...‐buffers

Global Local Row‐Buf Subarray1 but diff. subarrays Local Row‐Buf Bank Global Row‐Buf Global Row‐Buf 2. Utilize multiple local row‐buffers 89 90

Challenges: Global Structures Challenge #1. Global Address Latch

1. Global Address Latch VDD Local addr row‐buffer Decoder

∙∙∙ Latch Latch Latch 2. Global Bitlines VDD

Global Local row‐buffer Global row‐buffer

91 92 Solution #1. Subarray Address Latch Challenges: Global Structures

VDD 1. Global Address Latch Local  Problem: Only one raised row‐buffer wordline Decoder

∙∙∙

Latch Latch Latch  Solution: Subarray Address V DD Latch Global Local row‐buffer 2. Global Bitlines Global Global latch  row‐buffer local latches 93 94

Challenge #2. Global Bitlines Solution #2. Designated-Bit Latch Global bitlines Global bitlines

Local Wire Local row‐buffer row‐buffer D Switch

Local Local row‐buffer row‐buffer Switch D Switch Global Global row‐bufferREAD READ row‐buffer 95 Selectively connect local to global 96 Challenges: Global Structures MASA: Advantages 1. Global Address Latch  Baseline (Subarray-Oblivious)1. Serialization  Problem: Only one raised wordline Wr 2 3 Wr 2 3 Rd 3 Rd time  Solution: Subarray Address Latch 2. Write Penalty 3. Thrashing Saved 2. Global Bitlines  MASA Wr Rd  Problem: Collision during access time Wr Rd

MASA Solution: (Multitude Designated-Bit of Activated Latch 97 98 Subarrays)

MASA: Overhead Cheaper Mechanisms

 DRAM Die Size: 0.15% increase Latches Penalty  Subarray Address Latches ‐ Serialization Wr Thrashing  Designated-Bit Latches & Wire 1. 2. 3.  DRAM Static Energy: Small increase D MASA  0.56mW for each activated subarray  But saves dynamic energy SALP‐2  Controller: Small additional storage D  Keep track of subarray status (< 256B) SALP‐1

 Keep track of new timing constraints 99 100 System Configuration SALP: Single-core Results 80% 70% MASA "Ideal"  System Configuration 60%  CPU: 5.3GHz, 128 ROB, 8 MSHR 50%  LLC: 512kB per-core slice 40%  Memory Configuration 30%  DDR3-1066 20% 17% 20%  (default) 1 channel, 1 rank, 8 banks, 8 subarrays-per- bank 10% 0%  (sensitivity) 1-8 chans, 1-8 ranks, 8-64 banks, 1-128 subarrays IPC Improvement  Mapping & Row-Policy  (default) Line-interleaved & Closed-row  (sensitivity) Row-interleaved & Open-row  DRAM Controller Configuration MASA achieves most of the benefit  64-/64-entry read/write queues per-channel 101 of having more banks (“Ideal”) 102  FR-FCFS, batch scheduling for writes

SALP: Single-Core Results Subarray-Level Parallelism: Results SALP-1 SALP-2 Baseline MASA Baseline MASA MASA "Ideal" 20% 1.2 100% 30% 17% 13% 1.0 80% 20% 7% 0.8 10% 60% 0.6

0% - Increase Rate %

19 40% 0.4 % DRAM +13 0.15 36.3 Die < 0.15% Normalized 0.2 20%

% % Row-Buffer Hit- Area Dynamic Energy 0.0 0% SALP‐1, SALP‐2, MASA improve performance at low cost MASA 103 increasesy energy- 104 efficienc New DRAM Architectures

 RAIDR: Reducing Refresh Impact  TL-DRAM: Reducing DRAM Latency RowClone: Fast Bulk Data  SALP: Reducing Bank Conflict Impact Copy and Initialization  RowClone: Fast Bulk Data Copy and Initialization

Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, "RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data" CMU Computer Science Technical Report, CMU-CS-13-108, Carnegie Mellon University, April 2013.

105

Today’s Memory: Bulk Data Copy Future: RowClone (In-Memory Copy)

1) High latency 3) No cache pollution 1) Low latency 3) Cache pollution Memory Memory

CPU L1 L2 L3 MC CPU L1 L2 L3 MC

2) High bandwidth utilization 2) Low bandwidth utilization 4) Unwanted data movement 4) No unwanted data movement Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and 107 108 Initialization of Bulk Data,” CMU Tech Report 2013. DRAM operation (load one byte) RowClone: in-DRAM Row Copy (and

4 Kbits Initialization) 4 Kbits

1. Activate row 1. Activate row A

3. Activate row B 2. DRAM array 2. Transfer Transfer DRAM array row row

4. Transfer row

Row Buffer (4 Kbits) Row Buffer (4 Kbits)

3. Transfer Data pins (8 bits) byte onto bus Data pins (8 bits)

Memory Bus Memory Bus

RowClone: Key Idea RowClone: Intra-subarray Copy

 DRAM banks contain

1. Mutiple rows of DRAM cells – row = 8KB 2. A row buffer shared by the DRAM rows 0 1 0 0 1 1 0 0 0 1 1 0 src

 Large scale copy ?10 ?1 ?0 ?10 ?01 ?1 ?10 ?10 ?0 ?01 ?1 ?10 dst 1. Copy data from source row to row buffer 2. Copy data from row buffer to destination row Sense Amplifiers 0 1 0 0 1 1 0 0 0 1 1 0 Can be accomplished by two consecutive ACTIVATEs (row buffer) (if source and destination rows are in the same Deactivate subarray) Activate (src) Activate (dst) (our proposal)

111 112 RowClone: Inter-bank Copy RowClone: Inter-subarray Copy

dst dst

src src

temp

Read Write I/O Bus I/O Bus Transfer 1. Transfer (src to temp) (our proposal) 2. Transfer (temp to dst) 113 114

Fast Row Initialization RowClone: Latency and Energy Savings

1.2 Baseline Intra-Subarray Inter-Bank Inter-Subarray 1 11.5x 74x 0.8

0 0 0 0 0 0 0 0 0 0 0 0 0.6

0.4

0.2 Fix a row at Zero Normalized Savings (0.5% loss in capacity) 0 Latency Energy Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and 115 Initialization of Bulk Data,” CMU Tech Report 2013. 116 RowClone: Overall Performance RowClone: Latency and Energy Savings

Intra-subarray

Intra-subarray

117 118

Summary

 Major problems with DRAM scaling and design: high refresh rate, high latency, low parallelism, bulk data movement

 Four new DRAM designs Computer Architecture:  RAIDR: Reduces refresh impact Main Memory (Part III)  TL-DRAM: Reduces DRAM latency at low cost

 SALP: Improves DRAM parallelism

 RowClone: Reduces energy and performance impact of bulk data copy

 All four designs

 Improve both performance and energy consumption

 Are low cost (low DRAM area overhead) Prof. Onur Mutlu  Enable new degrees of freedom to software & controllers Carnegie Mellon University

 Rethinking DRAM interface and design essential for scaling

 Co-design DRAM with the rest of the system 119