Data Placement for Efficient Main Memory Access

DATA PLACEMENT FOR EFFICIENT MAIN MEMORY ACCESS by Kshitij Sudan A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Computing The University of Utah May 2013 Copyright c Kshitij Sudan 2013 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Kshitij Sudan has been approved by the following supervisory committee members: Rajeev Balasubramonian , Chair 10/05/2012 Date Approved John B. Carter , Member 10/24/2012 Date Approved Alan L. Davis , Member 10/25/2012 Date Approved Ashutosh Dhodapkar , Member Date Approved Mary W. Hall , Member 10/26/2012 Date Approved and by Alan L. Davis , Chair of the Department of School of Computing and by Donna M. White, Interim Dean of The Graduate School. ABSTRACT The main memory system is a critical component of modern computer systems. Dynamic Random Access Memory (DRAM) based memory designs dominate the industry due to mature device technology and low cost. These designs, however, face several challenges moving forward. These challenges arise due to legacy DRAM device design choices, advances in Central Processing Unit (CPU) design, and the demand for higher memory throughput and capacity from applications. Due to the cost-sensitive nature of the DRAM industry, changes to the device architecture face significant challenges for adoption. There is thus a need to improve memory system designs, ideally without changing the DRAM device architectures. This dissertation addresses the challenges faced by DRAM memory systems by leverag- ing data management. Historically, data management/placement and its interaction with the memory’s hardware characteristics have been abstracted away at the system software level. In this dissertation, we describe mechanisms that leverage data placement at the operating system level to improve memory access latency, power/energy efficiency, and capacity. An important advantage of using these schemes is that they require no changes to the DRAM devices and only minor changes to the memory controller hardware. The majority of the changes are limited to the operating system. This thesis also explores data management mechanisms for future memory systems built using new 3D stacked DRAM devices and point-to-point interconnects. Using the schemes described here, we show improvements in various DRAM metrics. We improve DRAM row-buffer hit rates by co-locating parts of different Operating Sys- tem (OS) pages in the same row-buffer. This improves performance by 9% and reduces energy consumption by 15%. We also improve page placement to increase opportunities for power-down. This enables a three-fold increase in memory capacity for a given memory power budget. We also show that page placement is an important ingredient in building efficient networks of memories with 3D-stacked memory devices. We report a performance improvement of 49% and an energy reduction of 42% with a design that optimizes page placement and network topology. To my parents. CONTENTS ABSTRACT ................................................... ..... iii LIST OF FIGURES ................................................. vii LIST OF TABLES ................................................... ix ACKNOWLEDGMENTS ............................................ x CHAPTERS 1. INTRODUCTION ............................................... 1 1.1 Emerging Memory System Challenges .................. ............ 1 1.2 Data Placement to Improve Memory Accesses . ............ 2 1.3 ThesisStatement................................. .............. 4 1.4 ThesisOrganization .............................. .............. 4 2. DRAM BACKGROUND ......................................... 6 2.1 DRAMBasics ...................................... ........... 6 2.2 DRAM Memory ChannelDesigns........................ .......... 8 2.3 3D Stacked Memory Devices.......................... ............ 10 3. MICRO-PAGES ................................................. 11 3.1 Introduction .................................... .............. 11 3.2 Motivation...................................... .............. 12 3.2.1 Baseline Memory Organization . ............ 12 3.2.2 Baseline DRAM Addressing ........................ .......... 14 3.2.3 Motivational Results . ............. 14 3.3 ProposedMechanisms .............................. ............. 16 3.3.1 Reducing OS Page Size (ROPS) ...................... ......... 17 3.3.2 Hardware Assisted Migration (HAM) . ........... 21 3.4 Results......................................... .............. 23 3.4.1 Methodology ................................... ........... 23 3.4.2 Evaluation.................................... ............ 27 3.5 RelatedWork ..................................... ............ 34 3.6 Summary......................................... ............ 36 4. TIERED MEMORY ............................................. 37 4.1 Introduction .................................... .............. 37 4.1.1 MemoryPowerWall ............................... ......... 38 4.2 An Iso-Powered Tiered Memory Design . ............ 39 4.3 Memory Access Characteristics of Workloads . ............... 41 4.4 Implementation .................................. .............. 44 4.4.1 Implementing Tiered Memory in a DRAMMemorySystem ................................... 44 4.5 Results......................................... .............. 50 4.5.1 EvaluationMetric .............................. ............ 50 4.5.2 Experimental Infrastructure and Methodology . .............. 51 4.5.3 Evaluation.................................... ............ 53 4.6 Discussion ...................................... .............. 59 4.6.1 Contention for Shared Resources Among VMs . .......... 59 4.6.2 Iso-power Validation . ............. 60 4.6.3 Supporting Additional Memory .................... ........... 61 4.7 RelatedWork ..................................... ............ 61 4.8 Summary......................................... ............ 64 5. INTELLIGENT NETWORK OF MEMORIES ...................... 65 5.1 Introduction .................................... .............. 65 5.2 Intelligent Network of Memories (iNoM) . ............... 67 5.2.1 ChannelTopology............................... ........... 67 5.2.2 OS Support for Data Allocation . ........... 71 5.2.3 Supporting Optimizations....................... ............. 74 5.3 Methodology andResults ........................... ............. 75 5.3.1 Methodology ................................... ........... 75 5.3.2 Results ....................................... ........... 78 5.4 RelatedWork ..................................... ............ 87 5.5 Summary......................................... ............ 87 6. CONCLUSIONS AND FUTURE WORK .......................... 89 6.1 Conclusions ..................................... .............. 89 6.2 FutureWork ...................................... ............ 90 6.2.1 High-Performance Memory Systems . ........... 91 6.2.2 Low-Power Memory Systems ........................ ......... 91 6.2.3 Balanced Memory Systems ......................... .......... 92 REFERENCES ................................................... 93 vi LIST OF FIGURES 2.1 Typical DRAM Main Memory Organization and Data Layout. ........... 7 3.1 Row-buffer Hit-Rates for 1- and 4-Core Configurations . .............. 13 3.2 Baseline DRAM Address Mapping ...................... ............ 14 3.3 sphinx3 ......................................... .............. 15 3.4 gemsFDTD ........................................ ............ 15 3.5 Re-Mapped DRAM Addressing ......................... ............ 22 3.6 Accesses to Micro-Pages in Reserved DRAM Capacity. .............. 28 3.7 Performance Improvement and Change in TLB Hit-Rates (w.r.t. Baseline) for ROPS with 5 Million Cycle Epoch Length ................... ......... 29 3.8 Energy Savings (E ∗ D2) for ROPS with 5 Million Cycle Epoch Length . 30 3.9 Profile, HAM, and ROPS - Performance Improvement for 5 Million Cycle EpochLength ........................................ .......... 30 3.10 Profile, HAM, and ROPS - Energy Savings (E ∗D2) for 5 Million Cycle Epoch Length ............................................. ........... 32 3.11 Profile, HAM, and ROPS - Performance Improvement for 10 Million Cycle EpochLength ........................................ .......... 33 3.12 Profile, HAM, and ROPS - Energy Savings (E ∗ D2) Compared to Baseline for10Million CycleEpoch............................. ............ 33 3.13 Profile, HAM, and ROPS - Performance Improvements for 10M Cycle Epoch Length, and 1 Billion Execution Cycles. ............. 34 4.1 Cumulative Access Distribution of DRAM Memory Requests for Different Workloads.......................................... ............ 42 4.2 Page Access Distribution at 4 KB Granularity Measured on Our Simulator. 43 4.3 Iso-Powered Tiered Memory Capacity Relative to an 8-rank Conventional System. ............................................ ........... 47 4.4 Cold Tier Rank Mode Changes with Request Arrival. ............. 48 4.5 Behavior of libquantum onaTieredMemorySystem ................... 55 4.6 Aggregate Performance for Baseline, and 3X-capacity Configurations . 56 4.7 Peak Aggregate Performance for 2X- and 3X-capacity Memory Configurations. 57 4.8 Memory Behavior of Benchmark Applications . .............. 58 4.9 Impact of Limiting Number of Migrations per Epoch. .............. 59 5.1 The Intel Scalable Memory

Data Placement for Efficient Main Memory Access

RS-485: Still the Most Robust Communication Table of Contents

IPUG75 2.2 Table of Contents

Getting Started with Your PCI Serial Hardware and Software for Windows 95

LVDS Owner's Manual

Single-Port PCI Card— RS-232/422/485/530

Getting Started with Modbus March 31, 2020

Archived: Getting Started with Your at Serial Hardware and Software For

Low Speed Multidrop Busses

Small but Powerful Signal Processing Sub-Systems

Firewire™ Reference Tutorial (An Informational Guide) January 22

AN1989: RS-422 Vs RS-485: Similarities and Key Differences

PRIAM and Vmebus at CERN 3 C