CACM Observations on Super
Total Page:16
File Type:pdf, Size:1020Kb
view point Gordon Bell Photo illustration by Robert Vizzini 1995 Observations on Supercomputing Alternatives: Did the MPP Bandwagon Lead to a Cul-de-Sac? or over a decade, govern- During 1995, Cray Research, puter and Tandem Computers Fment and the technical com- Fujitsu, IBM, Intel, NEC, and Sili- announced scalable computer puting community has focused on con Graphics introduced new clusters based on P6 for the com- achieving a teraflop speed super- technical computers. Intel mercial market. Dongarra’s Sur- computer. In 1989, I predicted announced the P6, a PC-compati- vey of Technical Computing Sites this goal would be reached in ble chip with a peak advertised shows that the world’s top 10 have mid-1995 for a $30 million com- performance (PAP) of 133Mflops installed peak capacity of about puter by using interconnected, to be raised to 266Mflops. In Sep- 850Gflops, all of which contain “killer” complimentary metal tember, Sandia ordered a $45.6 hundreds of computers. oxide semiconductor (CMOS) million, 9,072 processor, 1.81Gflops Teracomputer, an ARPA-fund- microprocessors [3–5]. The goal computer using the chip sched- ed state computer company, went is likely to be reached in 1996 in a uled to be installed in November public with an initial public offer- much more dramatic fashion 1996 that will provide ing to raise money to complete its than predicted because it is likely 39Kflops/dollar or 1.2Tflops at computer. In the same period, to be based on PC technology. the $30 million supercomputer Thinking Machines, a state com- Furthermore, by clustering PCs price in 1989. Adjusting for infla- puter company, and Kendall using System Area Nets (SANs), tion allows the 1996 supercom- Square Research, which offered scalable computing can be widely puter price to rise to $40 million massive parallelism with over available at low cost. and gets 1.6Tflops. Compaq Com- 1,000 processors, filed for Chap- COMMUNICATIONS OF THE ACM March 1996/Vol. 39, No. 3 11 viewpoint other companies, such as Digital computer formed from single, Equipment are still entering the fast vector processor comput- ter 11 but reemerged to offer soft- market. ers connected via a fast, high- ware and systems based on inter- These events call for a look at capacity switch (Fujitsu). The connected workstations. In how technical computing is now vector processor is implement- March, Cray Computer filed for likely to evolve. ed in CMOS technology. NEC Chapter 11, following the demise Five distinct computer struc- has also announced a CMOS of ACRI, aka Stern Computer tures are now vying for survival: vector processor operating at Company of Lyon, France. Con- 2Gflops per node that can vex, which uses Hewlett-Packard’s • Cray vector-style architecture scale to 512 processors. PA-RISC chips, was bought by supercomputers consisting of • Headless workstation clusters, Hewlett-Packard. Other small multiple, vector processors that or multicomputers, formed companies making parallel com- access a common memory and from workstation “killer” CMOS puters are certain to fail, while build from the fastest ECL microprocessor computers con- (emitter coupled logic) and nected via SANs that are propri- GaA (gallium arsenide) circuit etary, high-bandwidth, Figure 1. technology (Cray Research and low-latency switches; the IBM PAP* Gflops(t) for supers and NEC); Fujitsu and Hitachi have SP2 uses stacks of workstations. MPP’s for $30M (unless noted). switched to CMOS but remain UC/Berkeley is building clus- Peak and # Proc. (in parenthesis) on this path. ters using off-the-shelf Sun *Peak Advertised Performance • A computer cluster or multi- Microsystems workstations inter- connected via Myrinet’s high- bandwidth switch. Intel’s X Intel (Sandia) Paragon is formed from special- 1000 ly packaged, CMOS micro- processor computers connected 1 TT Fujitsu 60% /yr = $100M (512) via its high-bandwidth, low- latency switch. Tandem and extrapolation Cray DARPA to reach 1 Tf (propose) Compaq have introduced clus- in mid-1990 ters for the commercial market NEC using Tandem’s ServerNet to interconnect Compaq 4 proces- CM5 Cray T3D sor computers. (1K) • “Multis,” or multiple, CMOS microprocessors connected to 100 Intel T90 (02) large caches that access a com- mon memory via a common bus (Cray Superserver using Bell prize Sun SPARC micros, Silicon Graphics Power Challenge Fujitsu (35Gf/16) using MIPS micros) that I pre- dicted to be computing’s “main- IBM SP2 (17Gf/64) line” structure [2] and have Cray Res. limited scalability of about 10, (supers) although Cray’s Superserver 10 uses 64 SPARC processors. • Distributed shared-memory multiprocessors formed from SGI (4.8Gf/16) workstation CMOS micro- processor or multimicroproces- sor computers that communicate with one another via a proprietary, high-band- width, low-latency switch. Processors can access both local and remote memories as 1 a multiprocessor (Convex, 1990 1995 2000 Cray). Silicon Graphics is fol- 12 March 1996/Vol. 39, No. 3 COMMUNICATIONS OF THE ACM lowing this path for scalability. developing parallel computers. fastest CMOS micros to equal a Other companies are using the More impressive is the fact that supercomputer vector processor IEEE Scalable Coherent Inter- technical users have made in peak power. When used in par- face (SCI), to build scalable progress in realizing the PAP for allel, power can be significantly multis with a single memory to various apps as shown by the Bell reduced, depending on the com- simplify the operating system Prize. The growth in apps perfor- puter (its memory and intercon- and apps porting. mance by this measure has nectability) and problem roughly doubled yearly, with the granularity. Figure 1 shows performance 1995 winner operating at Most vector apps are unlikely measured in PAP for a $30 mil- 0.5Tflops using a specialized to run on multicomputers for a lion expenditure, or roughly the computer. The winning MPP long time. Silicon Graphics’ multi cost of a supercomputer in the operated at 179Gflops. is more likely to provide paral- mid-1990s. Technical computing lelism for fine granularity even has evolved. Since 1990, ARPA’s 3. Price differences among the though its scalability and memory High Performance Computing alternatives are often explained bandwidth are limited. Silicon and Communication Initiative by differences in memory size and Graphics has the largest market (HPCCI) has stimulated the mar- bandwidth. With computers, you share for technical computing, ket by developing, purchasing get what you pay for. This rarely even though it is not the fastest. and using highly parallel comput- shows up in PAP, but appears Convex, Cray, Fujitsu, and NEC ers for scientific and technical downstream in RAP (real applica- are supporting traditional supers computing. It is especially inter- tion performance) and occasion- and MPPs. Since it is unlikely that esting to observe the effects of ally on benchmarks. However, in MPPs based on CMOS micros can this effort as the teraflop quest 1995, most computers operated take over supercomputer work- continues. well on the Linpack benchmark, loads, the transition, if it happens From the details of the provided there was sufficient at all, is certain to be costly. It is announcements and figure, I memory to scale the problem size more likely CMOS micros will draw 13 major conclusions: and cover communication over- approach the speed of supers head. because supers trade off vector 1. There is more diversity in com- speed for scalar speed. puting alternatives than I predict- 4. CMOS has effectively replaced ed. While competition makes for ECL and GaAs as the technology 6. The prediction by NEC and lower hardware cost, it inhibits for building the highest-perfor- me [4, 5] that a 1Tflop, classical the attraction of apps software by mance computers. Fujitsu’s multiprocessor supercomputer independent software vendors. CMOS vector processor has a would not be available until 2000 Cray (T90), Fujitsu, and NEC are higher PAP than Cray Research’s still seems possible, even though continuing to evolve the super- computers. the T90 supercomputer isn’t computer, utilizing existing apps. quite on this trajectory. The diffi- Fujitsu’s multicomputer is a cost- 5. The Cray vector-style architec- culty is building a high-band- effective hybrid of the traditional ture is not dead to be replaced by width, low-latency switch to super that enables existing apps multiple, slow CMOS workstation- connect processors and memo- to run effectively and be evolved. style processors. The common wis- ries, since latency increases with Silicon Graphics is evolving the dom within the U.S. academic bandwidth. A 1Tflop multiproces- workstation and compatible multi community, which is the domi- sor would require a switch of at with a wide range of apps. Con- nant receptor of research funding least 16Tbytes per second to feed vex, Cray, IBM, Intel, and nCUBE and sets the research and funding the vector units using the Cray are all trying to establish massively agenda, appears to have been formula. parallel processing (MPP) as a wrong. The MPP bandwagon ran viable computer structure. IBM is over vectors, replacing them with 7. No teraflop before its time. I likely to be successful based on its many interconnected “killer” predicted that a $30 million, ability to fund commercial apps. micros used for workstations. 1flop computer would be avail- Intel’s P6 microprocessor makes These workstation micros are low able in 1995 [3–5], or by mid- the PC the most likely candidate cost and may be tuned for the 1996 at the latest. The price of for the most cost-effective nodes benchmark de jour to provide computation, using Thinking in both the commercial and tech- high hype. MPP machines often Machines’ CM5 PAP as a refer- nical markets. perform poorly for problems ence, is only increased by 50% where high bandwidth between with Cray’s T3D MPP.