A Formal Approach to Memory Access Optimization: Data Layout, Reorganization, and Near-Data Processing

A Formal Approach to Memory Access Optimization: Data Layout, Reorganization, and Near-Data Processing Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering BERKIN AKIN B.S., Electrical and Electronics Engineering, Middle East Technical University M.S., Electrical and Computer Engineering, Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA, USA July 2015 Copyright © 2015 Berkin Akın Abstract The memory system is a major bottleneck in achieving high performance and energy efficiency for various processing platforms. This thesis aims to improve memory performance and energy efficiency of data intensive applications through a two-pronged approach which combines a formal representation framework and a hardware substrate that can efficiently reorganize data in memory. The proposed formal framework enables representing and systematically manipulating data layout formats, address mapping schemes, and memory access patterns through permutations to exploit the locality and parallelism in memory. Driven by the implications from the formal framework, this thesis presents the HAMLeT architecture for highly-concurrent, energy-efficient and low-overhead data reorganization performed completely in memory. Although data reorganization simply relo- cates data in memory, it is costly on conventional systems mainly due to inefficient access patterns, limited data reuse, and roundtrip data traversal throughout the memory hierarchy. HAMLeT pur- sues a near-data processing approach exploiting the 3D-stacked DRAM technology. Integrated in the logic layer, interfaced directly to the local controllers, it takes advantage of the internal fine-grain parallelism, high bandwidth and locality which are inaccessible otherwise. Its parallel streaming architecture can extract high throughput from stringent power, area, and thermal budgets. The thesis evaluates the efficient data reorganization capability provided by HAMLeT through sev- eral fundamental use cases. First, it demonstrates software-transparent data reorganization performed in memory to improve the memory access. A proposed hardware monitoring determines iii inefficient memory usage and issues a data reorganization to adapt an optimized data layout and address mapping for the observed memory access patterns. This mechanism is performed trans- parently and does not require any changes to the user software—HAMLeT handles the remapping and its side effects completely in hardware. Second, HAMLeT provides an efficient substrate to explicitly reorganize data in memory. This gives an ability to offload and accelerate common data reorganization routines observed in high-performance computing libraries (e.g., matrix transpose, scatter/gather, permutation, pack/unpack, etc.). Third, explicitly performed data reorganization enables considering the data layout and address mapping as a part of the algorithm design space. Exposing these memory characteristics to the algorithm design space creates opportunities for algorithm/architecture co-design. Co-optimized computation flow, memory accesses, and data layout lead to new algorithms that are conventionally avoided. iv Acknowledgement I owe all my success to my great advisors Professor James Hoe and Professor Franz Franchetti. Their support, advice and expertise made this work possible, I am very thankful for their time and effort. James has tremendous experience and expertise in addition to his great personality and sense of humor. Although this thesis was "done in his head" long before, it took me a while to "make it perfect". Franz is one of the smartest people that I have ever met. He always had brilliant ideas and creative solutions in every subject. Our long and productive discussions, which are made enjoyable thanks to his unique wit, contributed to a huge portion of this thesis. To me both James and Franz are both excellent advisors and good friends. I would like to thank my thesis committee members Professor Larry Pileggi and Professor Steve Keckler. Larry provided me great support and feedback throughout my PhD life including my thesis. I’m thankful to him for giving me the opportunity to access the state-of-the-art silicon technology and also teaching me how to use it. His collaboration and efforts were really valuable to me. Steve provided helpful feedback, constructive discussions and great architecture insights which made this thesis possible. I’m grateful for his time and effort. This work was partially sponsored by the Defense Advanced Research Projects Agency (DARPA) PERFECT program under agreement HR0011-13-2-0007. I’m thankful for their generous supports to fund this research. I’d like to also thank all the SPIRAL and CALCM team members in CMU. First, I’m thankful to Peter Milder for introducing me to his hardware infrastructure in SPIRAL, and for his help and col- v laboration during my first years in CMU. I’d like to also thank Michael Papamichael and Gabe Weisz for their company in the office and in various conferences, and also for their useful discussions on a lot of research or non-research topics. Special thanks to Michael for his suggestions on a lot of "how to fix a VW Golf" topics. I had the honor to work with great lab mates, researchers and post- docs in SPIRAL and CALCM including Nikos Alachiotis, Christos Angelopoulos, Kevin Chang, Eric Chung, Qi Guo, Yoongu Kim, Peter Klemperer, Bob Koutsoyannis, Tze Meng Low, Joe Mel- ber, Marie Nguyen, Eriko Nurvitadhi, Thom Popovici, Fazle Sadi, Lavanya Subramanian, Richard Veras, Yu Wang, Guanglin Xu, Jiyuan Zhang, Zhipeng Zhao, Qiuling Zhu, and many more. I would like to also thank excellent professors in CMU that I got a chance to meet and work together. Especially, I’d like to thank Professor Onur Mutlu for teaching me advanced computer architecture, Professor Jose Moura and Professor Markus Puschel for our collaborations in SPIRAL, special thanks to Markus for being a great host in Zurich. I thank all of the brilliant ECE staff, especially Claire Bauerle and Samantha Goldstein who made administrative tasks run smoothly. I feel lucky to be around great friends and fellow graduate students in Pittsburgh. There are too many to list but special thanks go to Ekin Sumbul and Cagla Cakir for their answers to my yet another circuits questions, and to Onur Albayrak, Melis Hazar, Meric Isgenc, Tugce Ozturk, Mert Terzi, Sercan Yildiz, and many more. Thank you all for making the life here so much fun. Trying to push processing "near" memory was a bit difficult, when I was "far away" from my family. I cannot express with words my gratefulness to my parents Adil Akin and Leyla Akin, and my sister Gizem Akin. Although they will not understand much of it, this thesis would be impossible without their love and support. Finally, I’d like to thank Sila Gulgec, my "home" here, for her continuous love and support. She was always with me both in the most stressful and the happiest times throughout this journey. This thesis would be meaningless without her. vi Contents 1 Introduction1 1.1 Motivation..........................................1 1.2 Thesis Approach and Contributions............................3 1.3 Thesis Organization....................................6 2 Background9 2.1 The Main Memory System................................9 2.1.1 DRAM Organization and Operation.......................9 2.1.2 3D-stacked DRAM Technology......................... 14 2.2 Hitting the Memory and Power Walls.......................... 18 2.3 Near Data Processing.................................... 20 2.3.1 3D Stacking Based Near Data Processing.................... 21 2.3.2 Hardware Specialization for NDP........................ 22 2.4 Data Layout and Reorganization............................. 23 3 A Formal Approach to Memory Access Optimization 29 3.1 Mathematical Framework................................. 29 3.1.1 Formula Representation of Permutations.................... 30 3.1.2 Formula Identities................................. 33 3.2 Permutation as a Data Reorganization Primitive..................... 34 3.3 Index Transformation.................................... 35 vii 3.4 A Formula Rewrite System for Permutations...................... 40 3.4.1 Permutation Rewriting.............................. 41 3.4.2 Labelled Formula................................. 42 3.4.3 Case Study..................................... 43 3.5 Practical Implications................................... 45 3.6 Spiral Based Toolchain................................... 47 4 HAMLeT Architecture for Data Reorganization 49 4.1 Data Reorganization Unit................................. 50 4.1.1 Reconfigurable Permutation Memory...................... 51 4.1.2 DMA Units..................................... 56 4.2 Address Remapping Unit................................. 58 4.2.1 Bit Shuffle Unit.................................. 59 4.2.2 Supporting Multiple Remappings........................ 59 4.3 Configuring HAMLeT................................... 61 5 Fundamental Use Cases 65 5.1 Automatic Mode...................................... 65 5.1.1 Changing Address Mapping........................... 67 5.1.2 Physical Data Reorganization.......................... 67 5.1.3 Memory Access Monitoring........................... 69 5.1.4 Host Application and Reorganization in Parallel................ 70 5.2 Explicit Mode........................................ 71 5.2.1 Spiral Based Toolchain.............................. 72 5.2.2 Explicit Memory Management.........................

A Formal Approach to Memory Access Optimization: Data Layout, Reorganization, and Near-Data Processing

Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms

A Cache-Efficient Sorting Algorithm for Database and Data Mining

How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?

Prefetching for Complex Memory Access Patterns

Eth-28647-02.Pdf

Detecting False Sharing Efficiently and Effectively

A Hybrid Analytical DRAM Performance Model

Performance Analysis of Cache Memory

Classifying Memory Access Patterns for Prefetching

STEALTHMEM: System-Level Protection Against Cache-Based Side Channel Attacks in the Cloud

Embedded Memory Architecture for Low-Power Application Processor

A Visual Performance Analysis Tool for Memory Bound GPU Kernels