Introduction to Parallel Processing : Algorithms and Architectures

Introduction to Parallel Processing Algorithms and Architectures PLENUM SERIES IN COMPUTER SCIENCE Series Editor: Rami G. Melhem University of Pittsburgh Pittsburgh, Pennsylvania FUNDAMENTALS OF X PROGRAMMING Graphical User Interfaces and Beyond Theo Pavlidis INTRODUCTION TO PARALLEL PROCESSING Algorithms and Architectures Behrooz Parhami Introduction to Parallel Processing Algorithms and Architectures Behrooz Parhami University of California at Santa Barbara Santa Barbara, California KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON , DORDRECHT, LONDON , MOSCOW eBook ISBN 0-306-46964-2 Print ISBN 0-306-45970-1 ©2002 Kluwer Academic Publishers New York, Boston, Dordrecht, London, Moscow All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Kluwer Online at: http://www.kluweronline.com and Kluwer's eBookstore at: http://www.ebooks.kluweronline.com To the four parallel joys in my life, for their love and support. This page intentionally left blank. Preface THE CONTEXT OF PARALLEL PROCESSING The field of digital computer architecture has grown explosively in the past two decades. Through a steady stream of experimental research, tool-building efforts, and theoretical studies, the design of an instruction-set architecture, once considered an art, has been transformed into one of the most quantitative branches of computer technology. At the same time, better understanding of various forms of concurrency, from standard pipelining to massive parallelism, and invention of architectural structures to support a reasonably efficient and user-friendly programming model for such systems, has allowed hardware performance to continue its exponential growth. This trend is expected to continue in the near future. This explosive growth, linked with the expectation that performance will continue its exponential rise with each new generation of hardware and that (in stark contrast to software) computer hardware will function correctly as soon as it comes off the assembly line, has its down side. It has led to unprecedented hardware complexity and almost intolerable development costs. The challenge facing current and future computer designers is to institute simplicity where we now have complexity; to use fundamental theories being developed in this area to gain performance and ease-of-use benefits from simpler circuits; to understand the interplay between technological capabilities and limitations, on the one hand, and design decisions based on user and application requirements on the other. In computer designers’ quest for user-friendliness, compactness, simplicity, high performance, low cost, and low power, parallel processing plays a key role. High-performance uniprocessors are becoming increasingly complex, expensive, and power-hungry. A basic trade-off thus exists between the use of one or a small number of such complex processors, at one extreme, and a moderate to very large number of simpler processors, at the other. When combined with a high-bandwidth, but logically simple, interprocessor communication facility, the latter approach leads to significant simplification of the design process. However, two major roadblocks have thus far prevented the widespread adoption of such moderately to massively parallel architectures: the interprocessor communication bottleneck and the difficulty, and thus high cost, of algorithm/software development. vii viii INTRODUCTION TO PARALLEL PROCESSING The above context is changing because of several factors. First, at very high clock rates, the link between the processor and memory becomes very critical. CPUs can no longer be designed and verified in isolation. Rather, an integrated processor/memory design optimization is required, which makes the development even more complex and costly. VLSI technology now allows us to put more transistors on a chip than required by even the most advanced superscalar processor. The bulk of these transistors are now being used to provide additional on-chip memory. However, they can just as easily be used to build multiple processors on a single chip. Emergence of multiple-processor microchips, along with currently available methods for glueless combination of several chips into a larger system and maturing standards for parallel machine models, holds the promise for making parallel processing more practical. This is the reason parallel processing occupies such a prominent place in computer architecture education and research. New parallel architectures appear with amazing regu- larity in technical publications, while older architectures are studied and analyzed in novel and insightful ways. The wealth of published theoretical and practical results on parallel architectures and algorithms is truly awe-inspiring. The emergence of standard programming and communication models has removed some of the concerns with compatibility and software design issues in parallel processing, thus resulting in new designs and products with mass-market appeal. Given the computation-intensive nature of many application areas (such as encryption, physical modeling, and multimedia), parallel processing will continue to thrive for years to come. Perhaps, as parallel processing matures further, it will start to become invisible. Packing many processors in a computer might constitute as much a part of a future computer architect’s toolbox as pipelining, cache memories, and multiple instruction issue do today. In this scenario, even though the multiplicity of processors will not affect the end user or even the professional programmer (other than of course boosting the system performance), the number might be mentioned in sales literature to lure customers in the same way that clock frequency and cache size are now used. The challenge will then shift from making parallel processing work to incorporating a larger number of processors, more economically and in a truly seamless fashion. THE GOALS AND STRUCTURE OF THIS BOOK The field of parallel processing has matured to the point that scores of texts and reference books have been published. Some of these books that cover parallel processing in general (as opposed to some special aspects of the field or advanced/unconventional parallel systems) are listed at the end of this preface. Each of these books has its unique strengths and has contributed to the formation and fruition of the field. The current text, Introduction to Parallel Processing: Algorithms and Architectures, is an outgrowth of lecture notes that the author has developed and refined over many years, beginning in the mid-1980s. Here are the most important features of this text in comparison to the listed books: 1. Division of material into lecture-size chapters. In my approach to teaching, a lecture is a more or less self-contained module with links to past lectures and pointers to what will transpire in the future. Each lecture must have a theme or title and must PREFACE ix proceed from motivation, to details, to conclusion. There must be smooth transitions between lectures and a clear enunciation of how each lecture fits into the overall plan. In designing the text, I have strived to divide the material into chapters, each of which is suitable for one lecture (l–2 hours). A short lecture can cover the first few subsections, while a longer lecture might deal with more advanced material near the end. To make the structure hierarchical, as opposed to flat or linear, chapters have been grouped into six parts, each composed of four closely related chapters (see diagram on page xi). 2. A large number of meaningful problems. At least 13 problems have been provided at the end of each of the 24 chapters. These are well-thought-out problems, many of them class-tested, that complement the material in the chapter, introduce new viewing angles, and link the chapter material to topics in other chapters. 3. Emphasis on both the underlying theory and practical designs. The ability to cope with complexity requires both a deep knowledge of the theoretical underpinnings of parallel processing and examples of designs that help us understand the theory. Such designs also provide hints/ideas for synthesis as well as reference points for cost–performance comparisons. This viewpoint is reflected, e.g., in the coverage of problem-driven parallel machine designs (Chapter 8) that point to the origins of the butterfly and binary-tree architectures. Other examples are found in Chapter 16 where a variety of composite and hierarchical architectures are discussed and some fundamental cost–performance trade-offs in network design are exposed. Fifteen carefully chosen case studies in Chapters 21–23 provide additional insight and motivation for the theories discussed. 4. Linking parallel computing to other subfields of computer design. Parallel computing is nourished by, and in turn feeds, other subfields of computer architecture and technology. Examples of such links abound. In computer arithmetic, the design of high-speed adders and multipliers contributes to, and borrows many methods from, parallel processing. Some of the earliest parallel systems were designed by re- searchers in the field of fault-tolerant computing in order to allow independent multichannel computations and/or dynamic replacement of failed subsystems. These links are pointed out throughout the book. 5. Wide coverage of important topics. The current text covers

Introduction to Parallel Processing : Algorithms and Architectures

Vertex Coloring

Raising the Bar for Using Gpus in Software Packet Processing Anuj Kalia and Dong Zhou, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David G

Easy PRAM-Based High-Performance Parallel Programming with ICE∗

Instructor's Manual

Lawrence Berkeley National Laboratory Recent Work

Computer Architecture: Parallel Processing Basics

PRAM Algorithms Parallel Random Access Machine

DISTRIBUTED COMPUTING ENVIRONMENT ABSTRACT The

Cost Optimization Pillar AWS Well-Architected Framework

Cost Optimization Pillar AWS Well-Architected Framework Cost Optimization Pillar AWS Well-Architected Framework

GPU Nodes for GROMACS Biomolecular Simulations

PC Hardware Contents