What's Next in High-Performance Computing?

We can trace the evolution from Crays, to clusters, to supercomputing centers. But where does it go from here? What's Next in High-Performance Computing? AFTER 50 YEARS of building high- Although there is vibrant com- performance scientific computers, puter architecture activity on two major architectures exist: clus- microprocessors and on high-end ters of Cray-style vector supercom- GoRDON BELL cellular architectures, we appear puters; and clusters of scalar uni- AND JIM GRAY to be entering an era of super- and multiprocessors. Clusters are in computing monoculture. Invest- transition from massively parallel ing in next generation software computers and clusters running pro- and hardware supercomputer prietary software to proprietary clus- architecture is essential to ters running standard software, and to improve the efficiency and efficacy of systems. do-it-yourself Beowulf clusters built from com- High performance comes from parallelism, modity hardware and software. In 2001, only fast-dense circuitry, and packaging technology. In five years after its introduction, Beowulf mobi- the 1960s, Seymour Cray introduced parallel lized a community around a standard architec- instruction execution using parallel and pipelined ture and tools. Beowulf's economics and (7600) function units (CDC 6600, 7600), and sociology are poised to kill off the other archi- by 1975 a vector register processor architecture tectural lines—and will likely affect traditional (Cray 1). These were the first production super- supercomputer centers as well. computers. By 1982, Cray Research had synthe- Peer-to-peer and Grid communities are begin- sized the multiprocessor (XMP) structure and ning to provide significant advantages for vector processor to establish the modern super- addressing parallel problems and sharing vast computer architecture. That architecture worked numbers of files. The Computational Grid can extremely well with Fortran because the inner- federate systems into supercomputers far beyond most loops could be carried out by a few the power of any current computing center. The pipelined vector instructions, and multiple centers will become super-data and super-appli- processors could execute the outermost loops in cation centers. While these trends make high-per- parallel. Several manufacturers adopted this formance computing much less expensive and architecture for large machines (for example, much more accessible, there is a dark side. Clus- Fujitsu, Hitachi, IBM, and NEC), while others ters perform poorly on applications that require built and delivered mini-supercomputers aka large shared memory. "Crayettes" (Alliant, Ardent, and Convex) in the COMMUNICATIONS OF THE ACM Febniary 2002/Vol. 45. No. 2 91 early 1980s. In 2001, Cray-style supercomputers 1990s, nearly all of these efforts had failed. The main remain a significant part (10%) ofthe market and are benefit was increased effort in scalability and paral- vital for applications with fine-grain parallelism on a lelism that helped shift the market to coarse-grain shared memory (for example, legacy climate model- parallelism required by a cluster. ing and crash codes.) Single node vector supers have Several other forces aided the transition to the clus- a maximum performance. To go beyond that limit, ter architecture. They were helped by exorbitant tar- they must be clustered. iffs and by policies that prevented U.S. government It has been clear since the early 1980s that clusters agencies from purchasing Japanese supercomputers. of CMOS-based killer micros would eventually chal- Low cost clusters empowered users to find an alterna- lenge the performance ofthe vector supers with much tive to hard-to-use, proprietary, and expensive archi- better price performance and an ability to scale to tectures. thousands of processors and memory banks. By 1985, The shifi: from vectors to micro-based clusters can companies such as Encore and Sequent began build- be quantified by comparing the Top500 machines in ing shared memory multiple-microprocessors with a 1993 with 2001.' Clusters and constellations from single shared bus that allowed any processor to access Compaq, Cray, HP, IBM, SGI, and Sun comprise all connected memories. Combining a cache with the 90% ofthe Top500. IBM supplied 42% ofthe 500, microprocessor reduced memory traffic by confining including the fastest (12.3Tfiops peak with 8192 memory accesses locally and by providing a mecha- processors) and slowest (96Gfiops peak with 64 nism to observe all memory transactions. By snooping processors). Vector supercomputers, including clus- the bus transactions, a single coherent memory image tered supers from Fujitsu, Hitachi, and NEC com- could be preserved. Bell predicted that all future comprise only 10%. NEC's 128-processor clustered puters or computer nodes would be multis [2]. A vector supercomputer operates at a peak of fiurry of new multidesigns emerged to challenge cus- 1.28Tfiops. Based on the ratio of their peak speeds, tom bipolar and ECL minicomputers and main- one vector processor is equal to 6-8 microprocessors. frames. Although supers' peak advertised performance (PAP) A cluster is a single system comprised of intercon- is very expensive, their real applications performance nected computers that communicate with one (RAP) can be competitive or better than clusters on another either via a message passing; or by direct, some applications. Shared memory computers deliver internode memory access using a single address space. RAP of 30-50% ofthe PAP; clusters typically deliver In a cluster, internode communication is 10-1000 5-15% [1]. times slower than intranode memory access. Clusters High-performance computing has evolved into a with over 1000 processors were called massively par- small, stable, high-priced market for vector supers allel processors or MPPs. A constellation connotes and constellations. This allows suppliers to lock cus- clusters made up of nodes with more than 16 proces- tomers into a unique hardware-software environ- sor multis. However, parallel sofi^vare rarely exploits ment, for example, PowerPC/Linux or SPARC/ the shared memory aspect of nodes, especially if it is Solaris. Proprietary environments allow vendors to to be portable across clusters. price systems at up to $30K per microprocessor ver- Tandem introduced its l6-node, uniprocesssor sus $3K per slice for commodity microprocessors, cluster architecture in 1975, followed in 1983 by and to maintain the margins needed to fund high- Digital VAXClusters and the Teradata's 1,024 node end, diseconomies of scale. database machine. This was followed by the IBM Sys- plex and SP2 in the early 1990s. By the late 1990s Enter Beowulf most manufacturers had evolved their micro-based The 1993 Beowulf Project goal was to satisfy products to be clusters or multicomputers [3]—the NASA's requirement for a lGfiops workstation cost- only known way to build an arbitrarily large, scalable, ing less than $50,000. The idea was to use commer- computer system. In the late 1990s, SGI pioneered cial off-the-shelf (COTS) hardware and software large, non-uniform memory access (NUMA) shared configured as a cluster of machines. In 1994, a 16- memory clusters. node $40,000 cluster built from Intel 486 comput- In 1983 ARPA embarked on the Strategic Com- ers achieved that goal. In 1997, a Beowulf cluster puting Initiative (SCI) to research, design, build, and won the Gordon Bell performance/price Prize. By buy exotic new, scalable, computer architectures. 2000, several thousand-node Beowulf computers About 20 research efforts and 40 companies were funded by ARPA to research and build scalable com- 'The TopSOO is a worldwide roster of rhe most powerful computers as measured by puters to exploit the new technologies. By the mid- Linpack; see www.Top500.org. 92 February 2002/Vol. 45. No. 2 COMMUNICATIONS OF THE ACM were operating. In June 2001, 28 Beowulfs were in decision of where and how to compute is a combina- the Top500 and the Beowulf population is estimated tion of cost, performance, availability (for example, to be several thousand. High schools can now buy resource allocation, application program, ease of and assemble a Beowulf using the recipe "How to access, and service), the applications focus and dataset Build a Beowulf" [6]. support, and the need or desire for individual control. Beowulf is mostly about software. The success of Economics is a key Beowulf advantage. The hard- Beowulf clusters stems from the unification of public ware and software is much less expensive. Centers add domain parallel tools and applications for the scien- a cost factor of 2 to 5. Indeed, a centers costs are tific software community. It builds on decades of par- explicit: space (equals air conditioning, power, and allel processing research and on many attempts to raised floors for wiring and chilled air ducts), net- apply loosely coupled computers to a variety of appli- working, and personnel for administration, system cations. Some of the components include: maintenance, consulting, and so on. A center's explicit costs are implicit when users build and operate their •Message passing interface (MPI) programming own centers because homegrown centers ride free on model; their organizational overhead that includes space, net- •Parallel virtual machine (PVM) programming, works, and especially personnel. execution, and debugging model; Sociology is an equally important Beowulf advan- •Parallel file system; tage. Its standards-setting and community nature, •Tools to configure, schedule,

What's Next in High-Performance Computing?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support