Compiler Techniques for Scalable Performance of Stream Programs on Multicore Architectures Michael I. Gordon

Compiler Techniques for Scalable Performance of Stream Programs on Multicore Architectures by Michael I. Gordon Bachelor of Science, Computer Science and Electrical Engineering Rutgers University, 2000 Master of Engineering, Computer Science and Electrical Engineering Massachusetts Institute of Technology, 2002 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2010 c Massachusetts Institute of Technology 2010. All rights reserved. Author................................................................... Department of Electrical Engineering and Computer Science May 21, 2010 Certified by . Saman Amarasinghe Professor Thesis Supervisor Accepted by . Terry P. Orlando Chair, Department Committee on Graduate Students 2 Compiler Techniques for Scalable Performance of Stream Programs on Multicore Architectures by Michael I. Gordon Submitted to the Department of Electrical Engineering and Computer Science on May 21, 2010, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architec- tural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, network- ing, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically deter- minable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating se- rializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer pro- ductivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup). Thesis Supervisor: Saman Amarasinghe Title: Professor 3 4 Acknowledgments I consider myself wonderfully fortunate to have spent the Oughts at the intellectual sanctuary that is MIT. In many ways it was a true sanctuary that offered me the freedom to pursue my passions in multiple fields, in multiple endeavors, and across multiple continents. During the experience, I learned from brilliant individuals, and made lifelong friends. I hope that MIT is a bit better for me having attended, and that I contributed to the advancement of science and technology in some small way. First and foremost, I wish to thank my advisor Saman Amarasinghe. I truly appreciate the freedom that Saman extended to me combined with his constant accessibility. Saman was always available when I needed help. He has an uncanny ability to rapidly understand a problem, put that problem in perspective, and help one to appreciate the important issues to consider when solving it. Saman believed in the importance of my work in development, and had much patience for (and faith in) my PhD work. I also thank my thesis committee members: Martin Rinard and Anant Agarwal. The best part of my MIT experience has been working with talented and passionate individuals. William Thies is my partner on the StreamIt project, and I am indebted to him for his collaboration and friendship. Bill is a wildly imaginative and brilliant individual without a hint of conceit. He possesses seemingly limitless patience and is fiercely honest, passionate, and compassionate. Sorry for gushing, but working with Bill was a great experience, and I miss our shared office. StreamIt is a large systems project, incorporating over 27 people (up to 12 at a given time). I led the parallelization efforts of the project including the work covered in this thesis. William Thies also contributed to the work covered in Chapters 4, 6, and 7 of this dissertation. William led the development of the StreamIt language, incorporating the first notions of structured streams as well as language support for hierarchical data reordering [TKA02, TKG+02, AGK+05]. Many other individuals made this research possible (see [Thi09] for the complete accounting). I thank all the members of the StreamIt group including Rodric Rabbah, Ceryen Tan, Michal Karczmarek, Allyn Dimock, Hank Hoffman, David Maze, Jasper Lin, Andrew Lamb, Sitij Agrawal, and Janis Sermulins, Matthew Drake, Jiawen Chen, David Zhang, Phil Sung, and Qiuyuan Li. I also thank member of the Raw team and the Tilera team: Anant Agarwal, Michael Taylor, Ian Bratt, Dave Wentzlaff, Jonathan Eastep, and Jason Miller. I thank all the members of the Computer Architec- ture Group at MIT who gave me feedback on my research: Ronny Krashinsky, Marek Olszewski, Samuel Larsen, Michael Zhang, Mark Stephenson, Steve Gerding, and Matt Aasted. I am grateful for all of the help of Mary McDavit. She constantly went above and beyond her duties to help me, and I always enjoyed our conversations on wide-ranging topics. I come from a small but close-knit family unit. My mother sacrificed much for me in my early years, and I was very lucky to be surrounded by multiple role models growing up. I am thankful that my family provided me with a loving and supportive environment. Thank you Mom, Dad (Steve), Bubby, Grandmom, Uncle Alan and Uncle Mark. I also thank the most recent addition to the family, Annette. In the last 5 years she has loved and supported me through graduate school and through all of my pursuits, no matter how far they have taken me from her. To the members of my family, I love you. In this final paragraph I thank the friends who added to the richness of life while at MIT. One problem with making friends at MIT is that they are all ambitious individuals who end up in disparate parts of the world. Technology has made it easier to stay it touch, but I will miss my 5 Band of Brothers: Michael Zhang, Sam Larsen, Ronny Krashinsky, Bill Thies, Marek Olszewski, Ian Bratt, Mark Stephenson, and Steve Gerding. I also thank my non-MIT friends for keeping me somewhat “normal”: Josh Kaplan, David Brown, Lisa Giocomo, Helen O’Malley, John Matogo, and Scott Schwartz. 6 Dedicated to the memory of my grandfather Ike (Zayde) Pitchon. A mensch who continues to inspire me... Contents 1 Introduction 17 1.1 Streaming Application Domain ............................ 20 1.2 The StreamIt Project ................................. 20 1.3 Multicore Architectures ................................ 22 1.4 Scheduling of Coarse-Grained Dataflow for Parallel Architectures ......... 24 1.5 Contributions ..................................... 27 1.6 Roadmap and Significant Results ........................... 28 2 Execution Model and the StreamIt Programming Language 31 2.1 Execution Model ................................... 31 2.2 The StreamIt Programming Language ........................ 36 2.3 Parallelism in Stream Graphs ............................. 41 2.4 High-Level StreamIt Graph Transformations: Fusion and Fission .......... 43 2.5 Sequential StreamIt Performance ........................... 44 2.6 Thoughts on Parallelization of StreamIt Programs .................. 45 2.7 Related Work ..................................... 49 2.8 Chapter Summary ................................... 51 3 A Model of Performance 53 3.1 Introduction ...................................... 53 3.2 Preliminaries ..................................... 54 3.3 Filter Fission ..................................... 55 3.4 Scheduling Strategies ................................. 56 3.5 Communication Cost ................................. 58 3.6 Memory Requirement and Cost ............................ 63 3.7 Load Balancing .................................... 64 3.8 Computation Cost ................................... 65 3.9 Discussion ....................................... 66 3.10 Related Work ....................................

Compiler Techniques for Scalable Performance of Stream Programs on Multicore Architectures Michael I. Gordon

Microcode Revision Guidance August 31, 2019 MCU Recommendations

Asrock G41C-VS Motherboard, a Reliable Motherboard Produced Under Asrock’S Consistently Stringent Quality Control

Toshiba Hardware Portfolio – Microsoft Support by Family

Class-Action Lawsuit

INTEL Central Processing Units (CPU) This Page of Product Is Rohs Compliant

Intel® Celeron® Processor 900 Series and Ultra Low Voltage 700 Series

HC19.21.810.45Nm Next Generation Intel® Core™ Microarchitecture

Nt* and Rtl* INT 2Eh CALL Ntdll!Kifastsystemcall

The Intel X86 Microarchitectures Map Version 2.0

Reference Guide for X86-64 Cpus

X86 Intrinsics Cheat Sheet Jan Finis [email protected]

Intel Transistor Innovation (Part 1)