Practical Data Compression for Modern Memory Hierarchies
Total Page:16
File Type:pdf, Size:1020Kb
Practical Data Compression for Modern Memory Hierarchies Gennady G. Pekhimenko CMU-CS-16-116 July 2016 Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Todd C. Mowry, Co-Chair Onur Mutlu, Co-Chair Kayvon Fatahalian David A. Wood, University of Wisconsin-Madison Douglas C. Burger, Microsoft Michael A. Kozuch, Intel Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2016 Gennady G. Pekhimenko This research was sponsored by the National Science Foundation under grant numbers CNS-0720790, CCF- 0953246, CCF-1116898, CNS-1423172, CCF-1212962, CNS-1320531, CCF-1147397, and CNS- 1409723, the Defense Advanced Research Projects Agency, the Semiconductor Research Corporation, the Gigascale Systems Research Center, Intel URO Memory Hierarchy Program, Intel ISTC-CC, and gifts from AMD, Google, IBM, Intel, Nvidia, Oracle, Qualcomm, Samsung, and VMware. We also acknowledge the support through PhD fellowships from Nvidia, Microsoft, Qualcomm, and NSERC. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: Data Compression, Memory Hierarchy, Cache Compression, Memory Compression, Bandwidth Compression, DRAM, Main Memory, Memory Subsystem Abstract Although compression has been widely used for decades to reduce file sizes (thereby con- serving storage capacity and network bandwidth when transferring files), there has been limited use of hardware-based compression within modern memory hierarchies of com- modity systems. Why not? Especially as programs become increasingly data-intensive, the capacity and bandwidth within the memory hierarchy (including caches, main mem- ory, and their associated interconnects) have already become increasingly important bot- tlenecks. If hardware-based data compression could be applied successfully to the memory hierarchy, it could potentially relieve pressure on these bottlenecks by increasing effective capacity, increasing effective bandwidth, and even reducing energy consumption. In this thesis, we describe a new, practical approach to integrating hardware-based data compression within the memory hierarchy, including on-chip caches, main memory, and both on-chip and off-chip interconnects. This new approach is fast, simple, and effective in saving storage space. A key insight in our approach is that access time (including de- compression latency) is critical in modern memory hierarchies. By combining inexpensive hardware support with modest OS support, our holistic approach to compression achieves substantial improvements in performance and energy efficiency across the memory hierar- chy. Using this new approach, we make several major contributions in this thesis. First, we propose a new compression algorithm, Base-Delta-Immediate Compression (B∆I), that achieves high compression ratio with very low compression/decompression latency. B∆I exploits the existing low dynamic range of values present in many cache lines to compress them to smaller sizes using Base+Delta encoding. Second, we observe that the compressed size of a cache block can be indicative of its reuse. We use this observation to develop a new cache insertion policy for compressed caches, the Size-based Insertion Policy (SIP), which uses the size of a compressed block as one of the metrics to predict its potential future reuse. Third, we propose a new main memory compression framework, Linearly Compressed Pages (LCP), that significanly reduces the complexity and power cost of supporting main iii memory compression. We demonstrate that any compression algorithm can be adapted to fit the requirements of LCP, and that LCP can be efficiently integrated with the existing cache compression designs, avoiding extra compression/decompression. Finally, in addition to exploring compression-related issues and enabling practical so- lutions in modern CPU systems, we discover new problems in realizing hardware-based compression for GPU-based systems and develop new solutions to solve these problems. iv Acknowledgments First of all, I would like to thank my advisers, Todd Mowry and Onur Mutlu, for always trusting me in my research experiments, giving me enough resources and opportunities to improve my work, as well as my presentation and writing skills. I am grateful to Michael Kozuch and Phillip Gibbons for being both my mentors and collaborators. I am grateful to the members of my PhD committee: Kayvon Fatahalian, David Wood, and Doug Burger for their valuable feedback and for making the final steps towards my PhD very smooth. I am grateful to Deb Cavlovich who allowed me to focus on my research by magically solving all other problems. I am grateful to SAFARI group members that were more than just lab mates. Vivek Seshadri was always supportive for my crazy ideas and was willing to dedicate his time and energy to help me in my work. Chris Fallin was a rare example of pure smartness mixed with great work ethic, but still always had time for an interesting discussion. From Yoongu Kim I learned a lot about the importance of details, and hopefully I learned something from his aesthetic sense as well. Lavanya Subramanian was my fellow cubic mate who showed me an example on how to successfully mix work with personal life and how to be supportive for others. Justin Meza helped me to improve my presentation and writing skills in a very friendly manner (as everything else he does). Donghyuk Lee taught me everything I know about DRAM and was always an example of work dedication for me. Nandita Vijaykumar was my mentee, collaborator, and mentor all at the same time, but, most importantly, a friend that was always willing to help. Rachata Ausavarungnirun was our food guru and one of the most reliable and friendly people in the group. Hongyi Xin reminded me about everything I almost forgot from biology and history classes, and also taught me everything I know now in the amazing field of bioinformatics. Kevin Chang and Kevin Hsieh were always helpful and supportive when it matters most. Samira Khan was always available for a friendly chat when I really need it. Saugata Ghose was my rescue guy during our amazing trip to Prague. I also thank other members of the SAFARI group for their assistance and support: HanBin Yoon, Jamie Liu, Ben Jaiyen, Yixin Luo, Yang v Li, and Amirali Boroumand. Michelle Goodstein, Olatunji Ruwase and Evangelos Vlachos, senior PhD students, shared their experience and provided a lot of feedback early in my career. I am grateful to Tyler Huberty and Rui Cai for contributing a lot to my research and for being excellent undergraduate/masters researchers who selected me as a mentor from all the other options they had. During my time at Carnegie Mellon, I met a lot of wonderful people: Michael Pa- pamichael, Gabe Weisz, Alexey Tumanov, Danai Koutra and many others who helped and supported me in many different ways. I am also grateful to people at PDL and CALCM groups for accepting me in their communities. I am grateful to my internship mentors for making my work in their companies mu- tually successful for both sides. At Microsoft Research, I had the privilege to closely work with Karin Strauss, Dimitrios Lymberopoulos, Oriana Riva, Ella Bounimova, Patrice Godefroid, and David Molnar. At NVIDIA Research, I had the privilege to closely work with Evgeny Bolotin, Steve Keckler, and Mike O’Connor. I am also grateful to my amazing collaborators from Georgia Tech: Hadi Esmaeilzadeh, Amir Yazdanbaksh, and Bradley Thwaites. And last, but not least, I would like to acknowledge the enormous love and support that I received from my family: my wife Daria and our daughter Alyssa, my parents: Gennady and Larissa, and my brother Evgeny. vi Contents Abstract iii Acknowledgmentsv 1 Introduction1 1.1 Focus of This Dissertation: Efficiency of the Memory Hierarchy.....1 1.1.1 A Compelling Possibility: Compressing Data throughout the Full Memory Hierarchy.........................2 1.1.2 Why Traditional Data Compression Is Ineffective for Modern Mem- ory Systems.............................3 1.2 Related Work................................3 1.2.1 3D-Stacked DRAM Architectures.................4 1.2.2 In-Memory Computing.......................4 1.2.3 Improving DRAM Performance..................4 1.2.4 Fine-grain Memory Organization and Deduplication.......5 1.2.5 Data Compression for Graphics..................5 1.2.6 Software-based Data Compression.................5 1.2.7 Code Compression.........................6 1.2.8 Hardware-based Data Compression................6 1.3 Thesis Statement: Fast and Simple Compression throughout the Memory Hierarchy.....................6 1.4 Contributions................................7 vii 2 Key Challenges for Hardware-Based Memory Compression9 2.1 Compression and Decompression Latency.................9 2.1.1 Cache Compression........................9 2.1.2 Main Memory........................... 10 2.1.3 On-Chip/Off-chip Buses...................... 10 2.2 Quickly Locating Compressed Data.................... 10 2.3 Fragmentation................................ 11 2.4 Supporting Variable Size after Compression................ 12 2.5 Data Changes after Compression...................... 12 2.6 Summary of Our Proposal......................... 13 3 Base-Delta-Immediate Compression 15 3.1 Introduction................................