Insert Your Title Here
Total Page:16
File Type:pdf, Size:1020Kb
SPAA’21 Panel Paper: Architecture-Friendly Algorithms versus Algorithm-Friendly Architectures Guy Blelloch William Dally Margaret Martonosi Computer Science NVIDIA Computer Science Carnegie Mellon University Santa Clara, CA Princeton University Pittsburgh, PA [email protected] Princeton, NJ [email protected] [email protected] Uzi Vishkin Katherine Yelick University of Maryland Institute for Electrical Engineering and Computer Sciences Advanced Computer Studies (UMIACS) University of California College Park, MD Berkeley, CA Contact author: [email protected] [email protected] ABSTRACT 1 Introduction, by Uzi Vishkin The current paper provides preliminary statements of the The panelists are: panelists ahead of a panel discussion at the ACM SPAA 2021 - Guy Blelloch, Carnegie Mellon University conference on the topic: algorithm-friendly architecture versus - Bill Dally, NVIDIA architecture-friendly algorithms. - Margaret Martonosi, Princeton University - Uzi Vishkin, University of Maryland CCS CONCEPTS - Kathy Yelick, University of California, Berkeley Architectures As Chair of the ACM SPAA Steering Committee and organizer Design and analysis of algorithms of this panel discussion, I would like to welcome our diverse Models of computation group of distinguished panelists. We are fortunate to have you. KEYWORDS We were looking for a panel discussion of a central topic to SPAA that will benefit from the diversity of our distinguished Parallel algorithms; parallel architectures panelists. We felt that the authentic diversity that each of the panelist brings is best reflected in preliminary statements, ACM Reference format: expressed prior to start sharing our thoughts, and, therefore, Guy Blelloch, William Dally, Margaret Martonosi, Uzi Vishkin and merits the separate archival space that follows. Katherine Yelick. 2021. SPAA’21 Panel Paper: Architecture-friendly algorithms versus algorithm-friendly architectures. In Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’21), July 6–8, 2021, Virtual Event, USA. ACM, New York, NY, USA, 2 Preliminary statement, by Guy Blelloch 7 pages. https://doi.org/10.1145/3409964.3461780 The Random Access Machine (RAM) model has served the computing community amazingly well as a bridge from Permission to make digital or hard copies of all or part of this work for personal or algorithms, through programming languages to machines. This classroom use is granted without fee provided that copies are not made or distributed in large part can be attributed to the superhuman achievements for profit or commercial advantage and that copies bear this notice and the full by computer architects in maintaining the RAM abstraction as citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy technology has advanced. These efforts include a variety of otherwise, or republish, to post on servers or to redistribute to lists, requires prior approaches for maintaining the appearance of random access specific permission and/or a fee. Request permissions from [email protected]. memory, including sophisticated caching and paging SPAA ’21, July 6–8, 2021, Virtual Event, USA. mechanisms, as well as many developments to extract © 2021 Copyright is held by the owner/author(s). Publication rights licensed to ACM. parallelism from sequences of RAM instructions, including ACM ISBN 978-1-4503-8070-6/21/07...$15.00. https://doi.org/10.1145/3409964.3461780 pipelining, register renaming, branch prediction, prefetching, out-of-order execution, etc. Based on this effort the RAM instructions used in abstract simple extensions that support accounting for locality, as well as algorithms closely match actual machine instructions and asymmetry in read-write costs. It seems like the emphasis language structures. It is easy to understand, for example, how should be on supporting such simple models by adapting the algorithmic concept of summing the elements of a sequence hardware when possible, rather than trying to push to the can be converted to a for loop at the language level, and a common user complicated and changing models capturing the sequence of RAM instructions roughly consisting of a load, add technology de-jour. Without such simple models we will not be to a register, increment a register, compare, and conditional able to educate the vast majority of the next generation of jump. The RAM abstraction has greatly simplified developing students on how to effectively design algorithms and efficient algorithms and code for sequential machines, but more applications for future machines. importantly has allowed us to educate innumerable students in the art of algorithm design, programming language design, 2.1 Bio programming, application development. These students are now Guy Blelloch is a Professor of Computer Science at Carnegie responsible for where we have reached in the field. Mellon University. He received a BA in Physics and BS in When analyzing algorithms in the abstract RAM, in all honesty, Engineering from Swarthmore College in 1983, and a PhD in it does not tell us much about actual runtime on a particular Computer Science from MIT in 1988. He has been on the faculty machine. But that is not its goal, and instead the goal of such a at Carnegie Mellon since 1988, and served as Associate Dean of model is to lead algorithm and application developers in the Undergraduate studies from 2016-2020. right direction with a concrete and easily understandable model His research contributions have been in the interaction of that greatly narrows the search space of reasonable algorithms. practical and theoretical considerations in parallel algorithms The RAM by itself, for example, does not capture the locality and programming languages. His early work on that is needed to make effective use of caches. In many cases implementations and algorithmic applications of the scan (prefix this locality is a second order effect---for example, when sums) operation has become influential in the design of parallel comparing an exponential vs. linear time algorithm. In some algorithms for a variety of platforms. His work on the work-span cases it is a first order effect, but it is easy to add a one level (or work-depth) view for analyzing parallel algorithms has cache to the RAM model, and hundreds of algorithms have been helped develop algorithms that are both theoretically and developed in such a model. When algorithms developed in this practically efficient. His work on the Nesl programming model satisfy a property of being cache oblivious, they will also language developed the idea of program-based cost-models, and work effectively on a multilevel cache. nested-parallel programming. His work on parallel The RAM, however, is inherently sequential so unfortunately it garbage collection was the first to show bounds on both time fails when coming to parallelism. We need a replacement (or and space. His work on graph-processing frameworks, such as replacements) that supports parallelism. As with the RAM it is Ligra and GraphChi and Aspen, have set a foundation for large- extremely important that this replacement(s) be simple so that it scale parallel graph processing. His recent work on analyzing the can be used to educate every computer scientist, as well as many parallelism in incremental/iterative algorithms has opened a new others, about how to develop effective (parallel) algorithms and view to parallel algorithms—i.e., taking sequential algorithms applications. As in the case of the RAM, supporting such simple and understanding that they are actually parallel when applied models at the algorithmic and programming level will require to inputs in a random order. Blelloch is an ACM Fellow. superhuman achievements on the part of computer architects, some that have already been achieved. To be clear here, this does not mean that the instructions on the machine need to 3 Preliminary statement, by Bill Dally. match or even have an obvious mapping from the model. There Function and space-time mapping (F&M) for is plenty of room for languages to implement abstractions---for harmonious algorithms and architectures example, a scheduler that maps abstract tasks to actual processors. But it does mean there is some clear translation of It is easy to make architectures and algorithms mutually friendly costs from the model to the machine, even if not a perfect once we replace two obsolete concepts, centralized serial predictor of runtime. The goal, again, is to guide the algorithm program execution and the RAM or PRAM model of execution designer and programmer in the right direction. As with cost with new models that are aligned with modern extensions of the RAM, it is fine if the models have refinements technology. With the appropriate models we can design both that account for locality, for example, but these should be simple programmable and domain-specific architectures that are extensions. extremely efficient and easily understood targets for programmers. At the same time, we can design algorithms with At least for multicore machines, there are parallel models that predictable and high performance on these architectures. are simple, use simple constructs in programming languages, and support cost mappings down to the machine level that The laws of physics dictate that modern computing engines reasonably capture real performance. This includes the fork-join (programmable and special purpose) are largely communication work-depth (or work-span) model. There are even reasonably limited. Their performance and energy consumption are almost entirely determined by the movement of data. This includes memory access. Reading or writing a bit-cell is extremely fast such as map, reduce, gather, scatter, and shuffle can be used by and efficient. All the cost in accessing memory is data many programs to realize common communication patterns. movement. Arithmetic and logical operations are much less Programmers that don’t want to bother with mapping can use a expensive. In 5nm technology, an add costs about 0.5fJ/bit and a default mapper – with results no worse than with today’s 32-bit add takes about 200ps.