Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

Maslsively Parallel Computers: Why Not Prwallel Computers for the Masses? Gordon Bell Abstract In 1989 I described the situation in high performance computers including several parallel architectures that During the 1980s the computers engineering and could deliver teraflop power by 1995, but with no price science research community generally ignored parallel constraint. I felt SIMDs and multicomputers could processing. With a focus on high performance computing achieve this goal. A shared memory multiprocessor embodied in the massive 1990s High Performance looked infeasible then. Traditional, multiple vector Computing and Communications (HPCC) program that processor supercomputers such as Crays would simply not has the short-term, teraflop peak performanc~:goal using a evolve to a teraflop until 2000. Here's what happened. network of thousands of computers, everyone with even passing interest in parallelism is involted with the 1. During the first half of 1992, NEC's four processor massive parallelism "gold rush". Funding-wise the SX3 is the fastest computer, delivering 90% of its situation is bright; applications-wise massit e parallelism peak 22 Glops for the Linpeak benchmark, and Cray's is microscopic. While there are several programming 16 processor YMP C90 has the greatest throughput. models, the mainline is data parallel Fortran. However, new algorithms are required, negating th~: decades of 2. The SIMD hardware approach of Thinking Machines progress in algorithms. Thus, utility will no doubt be the was abandoned because it was only suitable for a few, Achilles Heal of massive parallelism. very large scale problems, barely multiprogrammed, and uneconomical for workloads. It's unclear whether The Teraflop: 1992 large SIMDs are "generation" scalable, and they are clearly not "size" scalable. The CM2 achieved a 10 The quest for the Teraflops Supercomputer to operate at Gigaflop of performance for some large problems. a peak speed of 1012 floating-point operations per second is almost a decade old, and only one, three-)ear computer 3. Ultracomputer sized, scalable multicomputers (smC) generation from being fulfilled. To accelerate its were introduced by Intel and Thinking Machines, development would require an ultracomputer. First using "Killer" CMOS, 32-bit microprocessors that generation, ultracomputers are networked computers using offer over 100 Gflops in the supercomputer price switches that interconnect 1000s of computers to form a range. These product introductions join multicomputer, and cost $50 to $300 million in 1992. multicomputers from Alliant (now defunct), AT&T, These scalable computers are also classified as massively IBM, Intel, Meiko, Mercury, NCUBE, Parsytec, parallel since they can be configured to hafe more than Transtech, etc. At least Convex, Cray, Fujitsu, IBM, 1000 processing elements in 1992. Unfortunately, such and NEC are working on new generation smCs that computers are specialized since only highly puallel, coarse use 64-bit processors. The Intel Delta multicomputer grain applications, requiring algorithm imd program with 500 computers achieved a peak speed of over 10 development, can exploit them. Government purchase of Glops and routinely achieves several Glops on real such computers would be foolish since waiting three years applications using O(100) computers. will allow computers with a peak speed of a teraflops to be purchased at supercomputer prices ($30 million) caused By 1995, this score of efforts, together with the by advancements in semiconductors and the intense evolution of fast, LAN-connected workstations will competition resulting in "commodity supercomputing". create "commodity supercomputing". Workstation More importantly, substantially better computers will be clusters formed by interconnecting high speed available in 1995 in the supercomputer pric: range if the workstations via new high speed, low overhead funding that would be wasted in buying such computers is switches, in lieu of special purpose multicomputers spent on training and software to exploit them power. are advocated Frontiers of Massively Parallel Computation '92 Massively Parallel Computers: Why Not Prirallel Computers for the Masses? Gordon Bell Abstract In 1989 I described the situation in high performance computers including several parallel architectures that During the 1980s the computers engineering and could deliver teraflop power by 1995, but with no price science research community generally ignored parallel constraint. I felt SIMDs and multicomputers could processing. With a focus on high performance computing achieve this goal. A shared memory multiprocessor embodied in the massive 1990s High Performance looked infeasible then. Traditional, multiple vector Computing and Communications (HPCC) program that processor supercomputers such as Crays would simply not has the short-term, teraflop peak performance: goal using a evolve to a teraflop until 2000. Here's what happened. network of thousands of computers, everyone with even passing interest in parallelism is invohed with the 1. During the first half of 1992, NEC's four processor massive parallelism "gold rush". Funding-wise the SX3 is the fastest computer, delivering 90% of its situation is bright; applications-wise masshe parallelism peak 22 Glops for the Linpeak benchmark, and Cray's is microscopic. While there are several programming 16 processor YMP C90 has the greatest throughput. models, the mainline is data parallel Fortran. However, new algorithms are required, negating the decades of 2. The SIMD hardware approach of Thinking Machines - progress in algorithms. Thus, utility will no doubt be the was abandoned because it was only suitable for a few, Achilles Heal of massive parallelism. very large scale problems, barely multiprogrammed, and uneconomical for workloads. It's unclear whether The Teraflop: 1992 large SIMDs are "generation" scalable, and they are clearly not "size" scalable. The CM2 achieved a 10 The quest for the Teraflops Supercomputer to operate at Gigaflop of performance for some large problems. a peak speed of 1012 floating-point operatioris per second is almost a decade old, and only one, three-)ear computer 3. Ultracomputer sized, scalable multicomputers (smC) generation from being fulfilled. To accelerate its were introduced by Intel and Thinking Machines, development would require an ultracomputer. First using "Killer" CMOS, 32-bit microprocessors that generation, ultracomputers are networked coinputers using offer over 100 Gflops in the supercomputer price switches that interconnect 1000s of computers to form a range. These product introductions join multicomputer, and cost $50 to $300 million in 1992. multicomputers from Alliant (now defunct), AT&T, These scalable computers are also classified as massively IBM, Intel, Meiko, Mercury, NCUBE, Parsytec, parallel since they can be configured to have more than Transtech, etc. At least Convex, Cray, Fujitsu, IBM, 1000 processing elements in 1992. Unfortunately, such and NEC are working on new generation smCs that computers are specialized since only highly parallel, coarse use 64-bit processors. The Intel Delta multicomputer grain applications, requiring algorithm imd program with 500 computers achieved a peak speed of over 10 development, can exploit them. Government purchase of Glops and routinely achieves several Glops on real such computers would be foolish since waiting three years applications using O(100) computers. will allow computers with a peak speed of a teraflops to be purchased at supercomputer prices ($30 million) caused By 1995, this score of efforts, together with the by advancements in semiconductors and the intense evolution of fast, LAN-connected workstations will competition resulting in "commodity supercomputing". create "commodity supercomputing". Workstation More importantly, substantially better computers will be clusters formed by interconnecting high speed available in 1995 in the supercomputer pric: range if the workstations via new high speed, low overhead funding that would be wasted in buying such computers is switches, in lieu of special purpose multicomputers are advocated e spent on training and software to exploit their power. 4. Kendall Square Research introduced their KSR 1 due to the computer's idiosyncrasies, resulting in only a - scalable, shared memory multiprocessors (smP) with factor of 10 speedup. 1088 64-bit microprocessors. It provides a sequentially consistent memory and programming The 1992-1995 Generation of Computers model, proving that smPs are fsasible. A multiprocessor provides the greatest and most flexible By mid-1992 a completely new generation of ability for workload since any procmor can be computers have been introduced. Understanding a new deployed on either scalar or parallel (e.g. vector) generation and redesigning it to be less flawed takes at applications, and is general purpose, being equally least three years. Understanding this generation should useful for scientific and commercial processing, make it possible to build the next generation including transaction processing, databases, real time, supercomputer class machine, that would reach a teraflop and command and control. The KSR mchine is most of peak power for a few, large scale applications by the likely the blueprint for future scalable, massively end of 1995. Unlike previous computer generations that parallel computers. have been able to draw on experience from other computer classes and user experience (the market) to create their next University

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

High Performance Computing Through Parallel and Distributed Processing

2.5 Classification of Parallel Computers

A Massively-Parallel Mixed-Mode Computer Designed to Support

Massively Parallel Computing with CUDA

CS 677: Parallel Programming for Many-Core Processors Lecture 1

Parallel System Performance: Evaluation & Scalability

Scalable Task Parallel Programming in the Partitioned Global Address Space

Core Processors

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

Compiling for a Multithreaded Dataflow Architecture : Algorithms, Tools, and Experience Feng Li

Introduction to Parallel Computing

Scheduling on Asymmetric Parallel Architectures