Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures by Alejandro Salinger A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2013 c Alejandro Salinger 2013 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract Multi-core processors have become the dominant processor architecture with 2, 4, and 8 cores on a chip being widely available and an increasing number of cores predicted for the future. In ad- dition, the decreasing costs and increasing programmability of Graphic Processing Units (GPUs) have made these an accessible source of parallel processing power in general purpose computing. Among the many research challenges that this scenario has raised are the fundamental problems related to theoretical modeling of computation in these architectures. In this thesis we study several aspects of computation in modern parallel architectures, from modeling of computation in multi-cores and heterogeneous platforms, to multi-core cache management strategies, through the proposal of an architecture that exploits bit-parallelism on thousands of bits. Observing that in practice multi-cores have a small number of cores, we propose a model for low-degree parallelism for these architectures. We argue that assuming a small number of processors (logarithmic in a problem's input size) simplifies the design of parallel algorithms. We show that in this model a large class of divide-and-conquer and dynamic programming algorithms can be parallelized with simple modifications to sequential programs, while achieving optimal parallel speedups. We further explore low-degree-parallelism in computation, providing evidence of fundamental differences in practice and theory between systems with a sublinear and linear number of processors, and suggesting a sharp theoretical gap between the classes of problems that are efficiently parallelizable in each case. Efficient strategies to manage shared caches play a crucial role in multi-core performance. We propose a model for paging in multi-core shared caches, which extends classical paging to a setting in which several threads share the cache. We show that in this setting traditional cache management policies perform poorly, and that any effective strategy must partition the cache among threads, with a partition that adapts dynamically to the demands of each thread. Inspired by the shared cache setting, we introduce the minimum cache usage problem, an extension to classical sequential paging in which algorithms must account for the amount of cache they use. This cache-aware model seeks algorithms with good performance in terms of faults and the amount of cache used, and has applications in energy efficient caching and in shared cache scenarios. The wide availability of GPUs has added to the parallel power of multi-cores, however, most applications underutilize the available resources. We propose a model for hybrid computation in heterogeneous systems with multi-cores and GPU, and describe strategies for generic paralleliza- tion and efficient scheduling of a large class of divide-and-conquer algorithms. Lastly, we introduce the Ultra-Wide Word architecture and model, an extension of the word- RAM model, that allows for constant time operations on thousands of bits in parallel. We show that a large class of existing algorithms can be implemented in the Ultra-Wide Word model, achieving speedups comparable to those of multi-threaded computations, while avoiding the more difficult aspects of parallel programming. v Acknowledgements Many people have contributed, directly or indirectly, to a happy conclusion of this thesis. Lucky me, I had not only one, but two advisors to guide me through the tough path of a doctoral program. Most of the work in this thesis was carried out in direct collaboration with Professor Alejandro López-Ortiz. I am immensely grateful to him for guiding me in finding interesting problems, in identifying the right research questions, for providing me with crucial insights as to how to address them, and not less importantly, for helping me appreciate the strengths of my results. His invaluable advice has helped me make the right decisions both in my career as well as in other important aspects of my life. I am also deeply grateful to my co-supervisor Professor Ian Munro for providing me with his remarkable insights on research problems as well as with his wise career advice. Thank you Alex and Ian for enabling me to work in a friendly atmosphere. It has been a delight to work under your supervision. I would also like to thank my thesis committee members |Professors Hiren Patel, Prabhakar Ragde, Roberto Solis-Oba, and Bernard Wong| for their comments and suggestions, which have contributed to the improvement of the contents and presentation of this thesis. The results in this thesis would not have been possible without the important contribution of my co-authors. I thank Reza Dorrigiv, Alejandro López-Ortiz, Arash Farzan, Patrick Nicholson, and Robert Suderman for their ideas and hard work. I am also thankful to Jérémy Barbay, Francisco Claude, Robert Fraser, Shahin Kamali, Daniel Remenik, and Gelin Zhou and surely many others for the countless discussions that in one way or another shaped the research that led to the results in this thesis (and thanks Francisco for printing my thesis!). I would like to thank Wendy Rush and Helen Jardine for offering their help whenever I needed it. Thank you as well to Professor Daniel Berry for the invitations to celebrate several holidays. It means a lot to those of us who are away from our homes. I am also very grateful to the Natural Sciences and Engineering Research Council of Canada (NSERC) for providing me with funding through my supervisors and to the David R. Cheriton School of Computer Science for awarding me the Cheriton scholarship. I have been very fortunate to have met many great friends during my stay in Waterloo. Thanks to everybody with whom I have shared the Algorithms Lab |in particular to the long- lived clique BAPF| you made the lab my (n + 1)-th home. To everybody who has proudly worn the jersey of Hopeless Experts since I started playing in my first term. It has been my pleasure to share our successes and disappointments with you, and I thank you for letting me play team manager for so many years. To all the friends that have in one way or another made my PhD experience a very enjoyable one (and whose names I omit for fear that I might forget someone), I thank you and hope that our paths will cross again. I wish to thank my family for their love and support. To my parents, Renéand Anamar´ıa, who raised me to be who I am today and have supported me unconditionally in all my enterprises, vii academic or otherwise. I love you and I thank you for being the best parents anyone can ask for. To the sunshine of my life, my son Nicolás, who has taken some stress out of my studies by making me realize what the really important things in life are. Finally, a big thank you to my lovely wife Natasha. Thank you for painfully reading through my thesis and showing me where I should and where I should not put a comma. Thank you as well for making sure that I could concentrate on my work by taking care of everything else during the busiest times. Thank you for being so understanding and for always being there to listen and provide thoughtful advice. Your encouragement and support are an invaluable contribution to this thesis and to everything else in my life. I love you! viii Dedication To my parents ix Table of Contents List of Tables xvii List of Figures xix List of Algorithms xxiii 1 Introduction 1 1.1 Summary of Results and Structure of this Thesis...................3 2 Parallel Computation7 2.1 Sequential Models of Computation...........................7 2.1.1 Turing Machine..................................8 2.1.2 Random Access Machine.............................9 2.2 Parallelism in Computation: Flynn's Taxonomy.................... 10 2.3 Theoretical Modeling of Parallel Computation..................... 11 2.3.1 The Shared-Memory Model and the PRAM.................. 11 2.3.2 Performance Measures.............................. 13 2.3.3 Network Models................................. 15 2.3.4 Communication.................................. 17 2.3.5 Directed Acyclic Graphs............................. 18 2.3.6 Boolean Circuits and Parallel Complexity Classes............... 20 2.3.7 Alternating Turing Machines.......................... 24 xi 2.3.8 Vector Machines................................. 24 2.3.9 P-Complete Problems.............................. 25 2.3.10 Amdahl's Law.................................. 26 2.4 Parallel Architectures.................................. 27 2.5 Beyond the PRAM.................................... 29 2.5.1 Variants of the PRAM Model.......................... 29 2.5.2 Hierarchical Memory Models.......................... 31 2.5.3 Bridging Models................................. 31 2.6 Aspects of Parallel Programming............................ 33 2.6.1 Scheduling Multi-Threaded Programs..................... 34 2.7 Basic Parallel Algorithm Design Techniques...................... 38 2.7.1 Balanced Trees.................................. 38 2.7.2 Pointer Jumping................................

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

High Performance Computing Through Parallel and Distributed Processing

Vertex Coloring

Deterministic Execution of Multithreaded Applications

Paralellizing the Data Cube

Easy PRAM-Based High-Performance Parallel Programming with ICE∗

CS 2210: Theory of Computation

The Problem with Threads

CUDA Toolkit 4.2 CURAND Guide

Computer Architecture: Dataflow (Part I)

Instructor's Manual

Lawrence Berkeley National Laboratory Recent Work

Ece585 Lec2.Pdf