view point

Gordon Bell Photo illustration by Robert Vizzini

1995 Observations on Supercomputing Alternatives: Did the MPP Bandwagon Lead to a Cul-de-Sac?

or over a decade, govern- During 1995, Research, puter and Tandem Computers Fment and the technical com- Fujitsu, IBM, , NEC, and Sili- announced scalable computer puting community has focused on con Graphics introduced new clusters based on P6 for the com- achieving a teraflop speed super- technical computers. Intel mercial market. Dongarra’s Sur- computer. In 1989, I predicted announced the P6, a PC-compati- vey of Technical Computing Sites this goal would be reached in ble chip with a peak advertised shows that the world’s top 10 have mid-1995 for a $30 million com- performance (PAP) of 133Mflops installed peak capacity of about puter by using interconnected, to be raised to 266Mflops. In Sep- 850Gflops, all of which contain “killer” complimentary metal tember, Sandia ordered a $45.6 hundreds of computers. oxide semiconductor (CMOS) million, 9,072 processor, 1.81Gflops Teracomputer, an ARPA-fund- microprocessors [3–5]. The goal computer using the chip sched- ed state computer company, went is likely to be reached in 1996 in a uled to be installed in November public with an initial public offer- much more dramatic fashion 1996 that will provide ing to raise money to complete its than predicted because it is likely 39Kflops/dollar or 1.2Tflops at computer. In the same period, to be based on PC technology. the $30 million Thinking Machines, a state com- Furthermore, by clustering PCs price in 1989. Adjusting for infla- puter company, and Kendall using System Area Nets (SANs), tion allows the 1996 supercom- Square Research, which offered scalable computing can be widely puter price to rise to $40 million massive parallelism with over available at low cost. and gets 1.6Tflops. Com- 1,000 processors, filed for Chap-

COMMUNICATIONS OF THE ACM March 1996/Vol. 39, No. 3 11 viewpoint other companies, such as Digital computer formed from single, Equipment are still entering the fast vector processor comput- ter 11 but reemerged to offer soft- market. ers connected via a fast, high- ware and systems based on inter- These events call for a look at capacity switch (Fujitsu). The connected workstations. In how technical computing is now vector processor is implement- March, Cray Computer filed for likely to evolve. ed in CMOS technology. NEC Chapter 11, following the demise Five distinct computer struc- has also announced a CMOS of ACRI, aka Stern Computer tures are now vying for survival: vector processor operating at Company of Lyon, France. Con- 2Gflops per node that can vex, which uses Hewlett-Packard’s • Cray vector-style architecture scale to 512 processors. PA-RISC chips, was bought by consisting of • Headless workstation clusters, Hewlett-Packard. Other small multiple, vector processors that or multicomputers, formed companies making parallel com- access a common memory and from workstation “killer” CMOS puters are certain to fail, while build from the fastest ECL microprocessor computers con- (emitter coupled logic) and nected via SANs that are propri- GaA (gallium arsenide) circuit etary, high-bandwidth, Figure 1. technology (Cray Research and low-latency switches; the IBM PAP* Gflops(t) for supers and NEC); Fujitsu and Hitachi have SP2 uses stacks of workstations. MPP’s for $30M (unless noted). switched to CMOS but remain UC/Berkeley is building clus- Peak and # Proc. (in parenthesis) on this path. ters using off-the-shelf Sun *Peak Advertised Performance • A or multi- Microsystems workstations inter- connected via Myrinet’s high- bandwidth switch. Intel’s X Intel (Sandia) Paragon is formed from special- 1000 ly packaged, CMOS micro- processor computers connected 1 TT Fujitsu 60% /yr = $100M (512) via its high-bandwidth, low- latency switch. Tandem and extrapolation Cray DARPA to reach 1 Tf (propose) Compaq have introduced clus- in mid-1990 ters for the commercial market NEC using Tandem’s ServerNet to interconnect Compaq 4 proces- CM5 Cray T3D sor computers. (1K) • “Multis,” or multiple, CMOS microprocessors connected to 100 Intel T90 (02) large caches that access a com- mon memory via a common (Cray Superserver using Bell prize Sun SPARC micros, Power Challenge Fujitsu (35Gf/16) using MIPS micros) that I pre- dicted to be computing’s “main- IBM SP2 (17Gf/64) line” structure [2] and have Cray Res. limited of about 10, (supers) although Cray’s Superserver 10 uses 64 SPARC processors. • Distributed shared-memory multiprocessors formed from SGI (4.8Gf/16) workstation CMOS micro- processor or multimicroproces- sor computers that communicate with one another via a proprietary, high-band- width, low-latency switch. Processors can access both local and remote memories as 1 a multiprocessor (Convex, 1990 1995 2000 Cray). Silicon Graphics is fol-

12 March 1996/Vol. 39, No. 3 COMMUNICATIONS OF THE ACM lowing this path for scalability. developing parallel computers. fastest CMOS micros to equal a Other companies are using the More impressive is the fact that supercomputer vector processor IEEE Scalable Coherent Inter- technical users have made in peak power. When used in par- face (SCI), to build scalable progress in realizing the PAP for allel, power can be significantly multis with a single memory to various apps as shown by the Bell reduced, depending on the com- simplify the Prize. The growth in apps perfor- puter (its memory and intercon- and apps porting. mance by this measure has nectability) and problem roughly doubled yearly, with the granularity. Figure 1 shows performance 1995 winner operating at Most vector apps are unlikely measured in PAP for a $30 mil- 0.5Tflops using a specialized to run on multicomputers for a lion expenditure, or roughly the computer. The winning MPP long time. Silicon Graphics’ multi cost of a supercomputer in the operated at 179Gflops. is more likely to provide paral- mid-1990s. Technical computing lelism for fine granularity even has evolved. Since 1990, ARPA’s 3. Price differences among the though its scalability and memory High Performance Computing alternatives are often explained bandwidth are limited. Silicon and Communication Initiative by differences in memory size and Graphics has the largest market (HPCCI) has stimulated the mar- bandwidth. With computers, you share for technical computing, ket by developing, purchasing get what you pay for. This rarely even though it is not the fastest. and using highly parallel comput- shows up in PAP, but appears Convex, Cray, Fujitsu, and NEC ers for scientific and technical downstream in RAP (real applica- are supporting traditional supers computing. It is especially inter- tion performance) and occasion- and MPPs. Since it is unlikely that esting to observe the effects of ally on benchmarks. However, in MPPs based on CMOS micros can this effort as the teraflop quest 1995, most computers operated take over supercomputer work- continues. well on the Linpack benchmark, loads, the transition, if it happens From the details of the provided there was sufficient at all, is certain to be costly. It is announcements and figure, I memory to scale the problem size more likely CMOS micros will draw 13 major conclusions: and cover communication over- approach the speed of supers head. because supers trade off vector 1. There is more diversity in com- speed for scalar speed. puting alternatives than I predict- 4. CMOS has effectively replaced ed. While competition makes for ECL and GaAs as the technology 6. The prediction by NEC and lower hardware cost, it inhibits for building the highest-perfor- me [4, 5] that a 1Tflop, classical the attraction of apps software by mance computers. Fujitsu’s multiprocessor supercomputer independent software vendors. CMOS vector processor has a would not be available until 2000 Cray (T90), Fujitsu, and NEC are higher PAP than Cray Research’s still seems possible, even though continuing to evolve the super- computers. the T90 supercomputer isn’t computer, utilizing existing apps. quite on this trajectory. The diffi- Fujitsu’s multicomputer is a cost- 5. The Cray vector-style architec- culty is building a high-band- effective hybrid of the traditional ture is not dead to be replaced by width, low-latency switch to super that enables existing apps multiple, slow CMOS workstation- processors and memo- to run effectively and be evolved. style processors. The common wis- ries, since latency increases with Silicon Graphics is evolving the dom within the U.S. academic bandwidth. A 1Tflop multiproces- workstation and compatible multi community, which is the domi- sor would require a switch of at with a wide range of apps. Con- nant receptor of research funding least 16Tbytes per second to feed vex, Cray, IBM, Intel, and nCUBE and sets the research and funding the vector units using the Cray are all trying to establish massively agenda, appears to have been formula. parallel processing (MPP) as a wrong. The MPP bandwagon ran viable computer structure. IBM is over vectors, replacing them with 7. No teraflop before its time. I likely to be successful based on its many interconnected “killer” predicted that a $30 million, ability to fund commercial apps. micros used for workstations. 1flop computer would be avail- Intel’s P6 microprocessor makes These workstation micros are low able in 1995 [3–5], or by mid- the PC the most likely candidate cost and may be tuned for the 1996 at the latest. The price of for the most cost-effective nodes benchmark de jour to provide computation, using Thinking in both the commercial and tech- high hype. MPP machines often Machines’ CM5 PAP as a refer- nical markets. perform poorly for problems ence, is only increased by 50% where high bandwidth between with Cray’s T3D MPP. In 1992, I 2. The computing industry has processor and memory is suggested waiting to purchase a made impressive progress in required. It takes 8 to 10 of the $200 million 1Tflop ultra-com-

COMMUNICATIONS OF THE ACM March 1996/Vol. 39, No. 3 13 viewpoint (T3D,T3E). Cray has stated that it more cost-effective than any of needs to converge its approach to them, and has inherently lower puter from Thinking Machines parallelism and a common archi- overhead because fewer are [4, 5]. Based on its characteris- tecture. Convex is in a similar needed; tics and the inevitable progres- dilemma. • Having the vector architecture sion of technology, I argued that allows it to capitalize on the we should wait until the system 11. My prediction [4, 5] that plethora of supercomputer could be available at a price of MPPs will be built using a shared- apps developed over the last 20 only $30 million. memory multiprocessor architec- years; Intel’s 1.8Tflops computer more ture was optimistic. The • It is aggressively priced (it is than satisfies the wait. Intel pro- multicomputer with multiple, CMOS, and uses synchronous vides a new high watermark in per- independent computers intercon- DRAMs) scales from a cost of formance and performance/price. necting via a switch is the hard- less than $500,000 to a project- P6 offers the power of the fastest ware structure for the foreseeable ed cost of $100 million for ter- workstation micros at a “commodi- future to obtain the maximum aflops by 2000; ty” PC price level of less than peak power because it uses • The low entry cost and scalabili- $10,000. In this fashion, future unmodified workstations. Soft- ty increases its market size so MPPs are likely to be more heavily ware often manages and presents that it will compete across the based on the X86 architecture. the structure as a single memory. technical marketplace from Kendall Square Research provid- workstations to servers to mini- 8. Thinking Machines and other ed the first scalable multiproces- supercomputers and traditional competitors vanished. Govern- sors. Researchers are focused on supercomputers and the range ment subsidies affected the ability the multiprocessor and have of MPPs. to function in a competitive, pub- made progress. Other efforts are lic marketplace. Larger compa- aimed at using SCI for building 13. Berkeley’s NOW (Network of nies have since entered the distributed shared-memory com- Workstations) [1] project con- market, and only recently have puters. The Convex and Cray nects workstations through either significant apps appeared. (T3D,T3E) provide a shared an ATM or Myrinet switch [7]. Government should stop the memory but utilize it as a multi- PCs and workstations with 1 to 4 direct subsidy of computer design computer. Silicon Graphics has a processors, no overhead back- and associated targeted purchas- physically central memory for its plane but limited PAP, are the es. The best and perhaps only way multiprocessor. most cost-effective to manufac- I know of to develop an industry ture. IBM’s SP2, based on is through university research pro- 12. In 1995, the world’s fastest uniprocessor workstations and its totypes that go to start-ups or installation was a multicomputer. proprietary switch, belies this fact existing companies, and by the The Japanese threat continues to because its price of almost competitive purchase of new sys- materialize with Fujitsu’s VPP 300, $100,000 per workstation is well tems by leading-edge, govern- which is is significant for a num- above workstation-level prices. In ment-funded users. ber of reasons: contrast, Intel’s system is only $10,000 per dual-processor node. 9. The price of supercomputers • It is an engineering compromise Significant opportunities exist and MPPs has converged more between a classical Cray multi- based on the PC. than predicted. In 1992 the two ple, vector processor supercom- differed by a factor of 10 and in puter and an MPP; NOW is important for many 1996 the prediction is just three. • It is cost-effective measured by reasons, including having inde- More precisely, low-priced MPPs peak performance and several pendent manufacturers for the haven’t materialized since Think- real apps; network (switches) and platforms ing Machines left the market. Bet- • As the fastest vector processor, it that permit multivendor environ- ter supercomputers may be due is likely to outperform other ments. Over time, we expect low- to competition and to better fab- supercomputers for single cost SANs to emerge. Myrinet’s rication techniques. processor tasks; and Tandem’s ServerNet [8] are • As a multicomputer, it can func- candidates for standard switches. 10. Cray Research has placed tion as n-independent comput- Jim Gray and I are predicating three bets, including its mainline ers to compete with the future of computing based on vector multiprocessor (T90), supercomputers for workload; a small number of standards for SPARC-based multi (Cray Super- • Because of a high-bandwidth the SNAP (Scalable Network and server), and Digital Equipment’s switch and fastest nodes, it can Platforms) architecture [6]. Alpha-based multicomputer outperform any of the MPPs, is I believe funding university

14 March 1996/Vol. 39, No. 3 COMMUNICATIONS OF THE ACM purchases of NOW environments 1989), 1091–1101. 7. Boden, N.J., et. al. Myrinet: A gigabit- that either live in a single room 4. Bell, G. Ultracomputers: A teraflop per-second local area network. IEEE or are distributed with users will before its time. Science 256, (Apr. 3, Micro 15, 1 (Feb. 1995), 29–36. 1992), 64. 8. Horst, R.W. TNet: A reliable system prove to be a wise investment. 5. Bell, G., Ultracomputers: A teraflop area network. IEEE Micro, 15, 1 (Feb. The NOW structure will provide before its time. Commun. ACM, 35, 8 1995), 37–45. computing power and encourage (Aug. 1992), 27–45. the adoption of this paradigm. It 6. Bell, G. and Gray, J. The SNAP (scal- able network and platforms) architec- Gordon Bell is a senior researcher at Microsoft Corpo- would be desirable to have more ture. Report, Mar. 1995. ration and a computer industry consultant-at-large. standardized switches that are computer vendor independent and host multiple vendors. With a plethora of NOW environments, standards can form that will attract apps.

Conclusions It is hard to be completely opti- mistic about U.S. supercomput- ing. It appears to be a small, vanishing market niched away by all kinds of computers. I see sever- al options in addition to main- taining a “buy U.S.” policy. Cray, IBM, Intel, and Silicon Graphics have large, loyal customer bases, many apps, and inertia. Intel offers the real bright spot by pro- viding a powerful PC that can challenge any workstation. Suppli- ers have time to validate or rethink future product strategies. MPP apps are still difficult and will only get easier with fewer plat- form environments. Government funders should ponder their role and question whether they helped or possibly misled companies, such as Cray Research, through funding and other pressures. The government’s role going forward is still crucial. The myriad of options should continue to keep the technical market vibrant (shaken up and alive) for a long time. The PC is likely to be the greatest change agent. It’s a great time to be a user.

References 1. Anderson, T.A., Culler, D.E., and Pat- terson, D. A. A case for NOW (network of workstations. IEEE Micro. 15, 1, (Feb. 1995), 54–64. 2. Bell, C.G. Multis: A new class of multi- processor computers. Science 228, (April 26, 1985), 462–467. 3. Bell, G. The future of high perfor- mance computers in science and engi- neering. Commun. ACM 32, 9 (Sept.

COMMUNICATIONS OF THE ACM March 1996/Vol. 39, No. 3 15