VIRTUAL ROUNDTABLE

The 30th Anniversary of the Supercomputing Conference: Bringing the Future Closer— Supercomputing upercomputing’s nascent era was borne of the late History and the 1940s and 1950s Cold War and increasing tensions be- Stween the East and the West; the first installations—which demanded ex- Immortality of Now tensive resources and manpower beyond what private corporations Jack Dongarra, University of Tennessee, Oak Ridge National Laboratory, could provide—were housed in and University of Manchester university and government labs in Vladimir Getov, University of Westminster the United States, United Kingdom, and the Soviet Union. Following the Kevin Walsh, University of California, San Diego Institute of Advanced Study (IAS) stored program architec- A panel of experts discusses historical ture, these so-called von Neumann machines were implemented as the reflections on the past 30 years of the MANIAC at Los Alamos Scientific Laboratory, the Atlas at the Univer- Supercomputing (SC) conference, its leading sity of Manchester, the ILLIAC at role for the professional community and some the University of Illinois, the BESM machines at the Soviet Academy of exciting future challenges. Sciences, the Johnniac at The Rand Corporation, and the SILLIAC in Australia. By 1955, private industry

74 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/18/$33.00 © 2018 IEEE joined in to support these initiatives and the IBM User Group, SHARE was formed, the Digital Equipment Corpo- MACHINES OF THE FUTURE ration was founded in 1957, while IBM built an early wide area computer net- “The advanced arithmetical machines of the future will be electrical in nature, and work SAGE (Semi-Automatic Ground they will perform at 100 times present speeds, or more. Moreover, they will be Environment), and the RCA 501, using far more versatile than present commercial machines, so that they may readily be all transistor logic was launched in adapted for a wide variety of operations. They will be controlled by a control card 1958. or film, they will select their own data and manipulate it in accordance with the in- In 1988, the first IEEE/ACM SC con- structions thus inserted, they will perform complex arithmetical computations at ference in Kissimmee, Florida, was exceedingly high speeds, and they will record results in such form as to be readily held. At that time, custom-built vec- available for distribution or for later further manipulation.” tor mainframes were the norm; the —“As We May Think,” by Vannevar Bush (The Atlantic, July 1945) Y-MP was a leading machine of the day, with a peak performance of 333 Mflops per processor and could be equipped with up the eight processors; users typically accessed the machine HPC over the past 30 years, we have . Doing everything known over a dumb terminal at 9600 baud; invited 6 well-known experts—Gordon to be feasible for performance (for there was no visualization; a single Bell, Jack Dongarra, Bill Johnston, example, parallel units, lookahead, programmer would code and develop Horst Simon, Erich Strohmaier, and speculative execution). In pre-SC88, everything; and there were few tools Mateo Valero—all of whom offer com- there were trials and failures, such as or software libraries, and we relied on plimentary perspectives on the past violations of Amdahl’s Law1 by Single remote batch job submission. three to four decades of supercom- Instruction Multiple Data (SIMD) and Today, as we approach SC’s 30th an- puting. Their histories connect us in other architectures or pushing tech- niversary, late commodity massively community. nology that failed to reach a critical parallel platforms are the norm. The production (such as GaAs). At the be- HPC community has developed par- COMPUTER: Looking back at the early ginning of the first generation of com- allel debuggers and rich tool sets for years of electronic digital computers, mercial computing, code share and reuse. Remote access what do you see as the turning points joined as a to several at once is and defining eras of supercomputing founder and quickly demonstrated made possible by scientific gateways, that help chronicle the growth of the a proclivity for building the highest accessed over 10 and 100 Gbps net- HPC community and the SC confer- performance computer of the day; in works. High performance desktops ence at its 30th anniversary? essence, he defined and established with scientific visualization capabil- the tri-decade of supercomputing and ities are the chief methods we use to GORDON BELL: In 1961, after I vis- the market. cognitively grasp the quantity of data ited Lawrence Livermore National After the initial CDC 1604 (1960) produced by supercomputers. The Cold Laboratory with the Univac LARC, introduction, Seymour proceeded to War arms race has been eclipsed by an and Manchester University with the build computers without peer, includ- HPC race, and we are living through a Atlas prototype, I began to see what ing the CDC 6600 in 1965, and CDC radical refactoring of the time it takes supercomputing was about from a de- 7600 in 1969. He then formed Cray Re- to create new knowledge—and, con- sign and user perspective—namely, it search to introduce the vector proces- currently, the time it takes to learn was designing at the edge of the fea- sor Cray 1 (1976), which was followed how much we don’t know. The IEEE/ sibility envelope using every known by the multiprocessor Cray SMP (1985), ACM SC conference animates the technique. YMP (1988), and last C90 (1991). From community and allows us to see what The IBM Stretch was one of these 1965 through 1991, the Cray architec- knowledge changes, and what knowl- three 1960s computers aimed to tures defined and dominated computer edge stays stable over time. achieve over an order of magni- design and the market, which included To review and summarize the key tude performance increase over the CDC, Cray Research, Fujitsu, Hitachi, developments and achievements of largest, commercial state-of-the-art IBM, and NEC.

OCTOBER 2018 75 VIRTUAL ROUNDTABLE

MATEO VALERO: I see the rise, fall, the 500 largest installations and some market. A new criterion for determin- and resurgence of vector processors as of their main system characteristics ing which systems could be counted the turning points of supercomputing. are published twice a year. Its simplic- as supercomputers was needed. After Vector processors execute instructions ity has invited many critics but has two years of experimentation with whose operands are complete vectors. also allowed it to remain useful during various metrics and approaches, Hans This simple idea was so revolution- the advent and reigns of giga-, tera-, W. Meuer and I convinced ourselves ary, and the implementations were so and . Systems are that the best long-term solution was efficient that the resulting vector -su ranked by their performance of the to maintain a list of systems in ques- percomputers reigned supreme as the Linpack benchmark,3 which solves a tion, ranking them based on the ac- fastest computers in the world from dense system of linear equations. Over tual performance the system had 1975 to 1995. Vector processors exploit time, the data collected allowed early achieved when running the Linpack data-level parallelism elegantly, they identification and quantification of benchmark. Based on our previous could hide memory latency very well, many trends related to computer ar- market studies we were confident that and they are energy efficient since they chitectures used in HPC.4,5 we could assemble a list of at least 500 did not need to fetch and decode as systems that we had previously con- many instructions. After some early COMPUTER: How was the Linpack sidered supercomputers. This deter- prototypes, the vector supercomputing benchmark selected as a measure for the mined our cutoff. era started with Seymour Cray and his TOP500 ranking of supercomputers? Cray 1.2 The company continued with COMPUTER: There were other drivers Cray 2, Cray X-MP, and Cray T90, and fi- ERICH STROHMAIER: In the mid- beyond Cold War competition by the nalized with the Cray X1 and X1E. Cray 1980s, Hans W. Meuer started a small late 1970s and early 1980s, including Research was building vector proces- and focused annual conference series the revolution in personal computing. sors for 30 years. The implementation about supercomputing, which soon What events spurred the funding of was very efficient partially due to the evolved to become the International supercomputing in particular? radical technologies that were used Supercomputing Conference (www at the time such as transistor-based .isc-hpc.com). During the opening ses- BELL: In 1982, Japan’s Ministry of memory instead of magnetic-core sions of these conferences, he used to Trade and Industry established the and extra fast Emitter-Coupled Logic present statistics collected from ven- Fifth Generation Computer Systems (ECL) instead of CMOS, which enabled dors and colleagues about the num- research program to create an AI com- a very high clock rate. This landscape bers, locations, and manufacturers of puter. This stimulated DARPA’s de- was soon made more heterogeneous supercomputers worldwide. Initially, cade-long Strategic Computing Initia- with the vector imple- it was relatively obvious which sys- tive (SCI) in 1983 to advance computer mentations from Japan from Hitachi, tems should be considered as super- hardware and artificial intelligence. NEC, and Fujitsu. The vector supercom- computers. This label was reserved for SCI funded a number of designs, in- puters introduced many innovations, vector processing systems from com- cluding Thinking Machines, which from using massively multi-banked panies such as Cray, CDC, Fujitsu, NEC, was a key demonstration for displacing high-bandwidth memory systems, to and Hitachi, which competed in the transactional memory supercomput- multiprocessors with fast processor market and each claimed theirs had ers. Also, in 1982, an NSF/Intel-funded synchronization through registers, and the fastest system for scientific com- Caltech hypercube-connected multi- to accessing memory by using scatter/ putation by some selective measure. computer with 64 Intel microproces- gather instructions. For example, for However, at the end of that decade the sor computers created by Charles Seitz the first TOP500 list in 1993, 310 of the situation became increasingly more and Geoffrey C. Fox was first operated 500 machines listed were vector pro- complicated as smaller vector systems to demonstrate efficacy and efficiency cessors. But by 2007, only 4 vector pro- became available from some of these and stimulate further development, cessors remained in the TOP500 list. vendors and new competitors (Con- including, in 1985, commercial prod- vex, IBM), and as massively parallel ucts from Intel and nCUBE. COMPUTER: What were the roots systems (MPPs) with SIMD architec- In 1987, using a 1024-node nCUBE, and reasons for starting the TOP500 tures (Thinking Machines, MasPar) Robert E. Benner, John L. Gustafson, project? or MIMD systems based on scalar and Gary R. Montry at Sandia National processors (Intel, nCube, and others) Labs won the first Gordon Bell Prize,6 JACK DONGARRA: The TOP500 proj- entered the market. Simply counting which was established to recognize ect (www..org) has been track- the installation base for these systems progress in parallelism by showing ing information about installations of vastly different scales did not pro- that with sufficiently large problems, of supercomputers since 1993. A list of duce any meaningful data about the the serial overhead time could be

76 COMPUTER WWW.COMPUTER.ORG/COMPUTER 30 YEARS OF SC—ROUNDTABLE PANELISTS

Gordon Bell is a Researcher Emeritus at domain decomposition algorithms. Simon’s recursive spectral Microsoft, and he was the vice president of bisection algorithm is a breakthrough in parallel algorithms. He R&D and Digital Equipment Corporation has been twice honored with the prestigious Gordon Bell Prize, (DEC), where he where he led the most recently in 2009 for the development of innovative development of the first mini- and techniques that produce new levels of performance on a real time-sharing computers. As the first NSF application (in collaboration with IBM researchers), and in director for computing (CISE), he led the 1988 in recognition of superior effort in parallel processing NREN (Internet) creation. Bell has worked research (with others from Cray and Boeing). Simon has on and written articles and books about computer architecture, attended every SC conference, and contributed to many high-tech startup companies, and lifelogging. He is a member papers, panels, and tutorials. He is also one of the TOP500 of ACM, the American Academy of Arts and Sciences, IEEE, the authors. Contact him at [email protected]. National Academy of Engineering, the National Academy of Science, and the Australia Academy of Technological Sciences Erich Strohmaier cofounded in 1993 and Engineering. In 1991, Bell received the US National Medal with Prof. Dr. Hans W. Meuer the TOP500 of Technology. He is a founding trustee of the Computer History project and has served as coeditor since. Museum in Mountain View. Contact him at [email protected]. He is a Senior Scientist and leads the Performance and Algorithms Research Jack Dongarra participated as both a Group at Lawrence Berkeley National panelist and coauthor for this Virtual Laboratory (LBNL). His research focuses on Roundtable. Please see the “About the performance characterization, evaluation, Authors” section for his biographical modeling, and prediction for high-performance computing information. (HPC) systems and on the analysis and optimization of data-intensive large-scale scientific workflows. Strohmaier received a PhD in theoretical physics from the University of Heidelberg. He was awarded the 2008 ACM Gordon Bell Prize William E. (Bill) Johnston, now retired, for parallel processing research in algorithmic innovation and was formerly a Senior Scientist and advisor was named a Fellow of the ISC conference in 2017. He is a to ESnet—a national network serving the member of ACM, IEEE, and the American Physical Society National Research Laboratories and (APS). Contact him at [email protected]. science programs of the US Department of Energy (DOE), Office of Science. Johnston Mateo Valero is a professor at Technical led ESnet from 2003 to 2008, during University of Catalonia, UPC, and the which time complete reanalysis of the director of the Barcelona Supercomputing requirements of DOE’s science programs that ESnet supports Center. His research focuses on high was completed. Johnston has worked in the field of computing performance architectures. He has for more than 50 years, and he taught computer science at San published over 700 papers, served in the Francisco State University at both undergraduate and graduate organization of more than 300 interna- levels. He has a Master’s in mathematics and physics from San tional conferences, and given more than Francisco State University. Contact him at [email protected]. 500 invited talks. Valero has been honored with the 2007 IEEE/ ACM Eckert-Mauchly Award; the 2015 IEEE Seymour Cray Horst Simon is deputy laboratory Award; the 2017 IEEE Charles Babbage Award; the 2009 IEEE director for research and chief research Harry Goode Award; and the 2012 ACM Distinguished Service officer (CRO) of Lawrence Berkeley Award. He is an IEEE and ACM Fellow; he holds Doctor Honoris National Laboratory (LBNL). His research Causa from 9 Universities; and he is a member of 8 Academies. interests include the development of In 2018, Valero was honored with “Condecoración de la Orden sparse matrix algorithms, algorithms for Mexicana del Águila Azteca,” the highest recognition granted by large-scale eigenvalue problems, and the Mexican Government. Contact him at [email protected].

OCTOBER 2018 77 VIRTUAL ROUNDTABLE

Pittsburg Supercomputer Center Data vault

HiPPI CM-2 Y-MP

FE

Internet, NSFNet

Supercomputing 1991, Alburquerque, NM

X-workstation

(a) (b)

Figure 1. (a) SC91 network topology between Pittsburgh Supercomputer Center (PSC) and Albuquerque, New Mexico, for the remote visualization demonstration; (b) remote (show floor) user interface for a real-time visualization of the human brain using distributed supercomputing resources across the NSFnet between the PSC and Albuquerque at SC91.

proportionally reduced to allow al- (a remote card reader) and basic job Supercomputer Center (PSC) to com- most perfect speedups. In retrospect, control at a few hundred bps. This was pute the visualization of the dataset that first SC conference in 1988 started followed by implementation of the File based on input from a workstation at at exactly the right technological time Transfer Protocol (FTP) on supercom- SC91 (in Albuquerque). These param- to stimulate, share, and chronicle the puters in the early 1990s as a means of eters were sent to PSC where the CM-2 development of the “post-Cray” era of getting remotely located data to super- and the Cray produced a visualization. computing. Even though the term “su- computer centers. This was then sent through a TCP cir- percomputer” appeared in print in the cuit from the Cray into the just-built early ’70s and by 1980 was understood COMPUTER: Is there an event or de- NSFNet 45 Mbps Internet backbone. to be the largest computer of the day, velopment in supercomputer net- NSFNet had for the first time been ex- the 1988 conference served to establish working that stands out as a pivotal tended to the SC show floor, and SCInet the industry as more than a “niche,” achievement? was first setup to manage the confer- but, more importantly, it communi- ence networking. The SCInet LAN con- cated the advances of three decades. JOHNSTON: A demonstration at SC91 nected a Sun workstation, where the was arguably the first use of wide area images were displayed. The 15 (or so) COMPUTER: How have networking networks to support a high-speed, Mbps that was achieved between the and supercomputing evolved over the TCP/IP-based distributed supercom- Cray and the Sun was sufficient to dis- years? puter application. The overall network play about 10-12 frames/sec on the Sun. topology of this network is shown Typical of distributed applications, BILL JOHNSTON: Supercomputing in Figure 1a. The challenge was real- many components had to interoper- and high-speed networking have time remote visualization of a large, ate to produce a functioning system, evolved sometimes independently— complex scientific dataset that was a an especially difficult task in a wide- though they inform each other—and high-resolution MRI scan of a human area network. Computer scientists at sometimes in concert. In the early days brain (see Figure 1b). The approach Lawrence Berkeley Laboratory, PSC, (1980s), network access to supercom- was to use a Thinking Machines CM-2 and Cray Research addressed the prob- puters was limited to remote job entry and Cray Y-MP at the NSF’s Pittsburgh lems of coupling the various processes

78 COMPUTER WWW.COMPUTER.ORG/COMPUTER running on three different computers, 1,000,000,000 and especially debugging a newly de- 100,000,000 fined TCP option that made high-speed 10,000,000 TCP possible in the wide area.7 1,000,000 COMPUTER: What were some of the 100,000 technological developments that cre- ated new solutions and new problems 10,000 to solve? 1,000 No. Processors, Bell Prize No. Processors, Linpack 100 BELL: In 1988, while the efficacy of Linpack G ops large-scale parallelism was demon- 10 Expon. (Linpack G ops) strated, the problem of converting 0 programs that ran on a mono-memory, 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 2020 multiprocessor supercomputer to a system running across 1,000 inter- connected slower computers. It took Figure 2. Three decades of performance and parallelism growth. While the Bell Prize a few more years before the circuit winners demonstrated high degrees of parallelism, 1993 was the year a 1024 com- speed of CMOS crossed over the speed puter Thinking Machines CM5 dominated performance. of ECL. In June 1993, a 1,024 computer Thinking Machines CM5, operating at a peak of 131 Gflops, executed Linpack The 1988 through 2018 period can COMPUTER: Taking the past as a refer- at 60 Gflops to be the first-place win- be trivialized by just noting that the ence, how do you see the current and ner of the first TOP500 supercomputer number of parallel processing ele- future position of vector processors in list. In the same year, Cray Research ments or cores went from 1024 nodes in the HPC space? abandoned plans to deliver their evo- 1988 to 40,960 nodes with 10,650,080 lutionary 32-processor computer that cores. The power required went from VALERO: The so-called “killer mi- operated at a peak of 64 Gflops. Ironi- a few kW to 15,370 kW. I have argued cros”8 wiped out the vector proces- cally, months later in November 1993, with members of the community that sors from the TOP500 list. This shift the second Top500 first-place winner the names, including single proces- to commodity superscalar processors was the Fujitsu Numerical Wind Tun- sor, constellation, MPP, and clusters, was driven by economics. In partic- nel (NWT) computer with 140 vector were essentially the same—multicom- ular, the Accelerated Strategic Com- processor computers operated at 120 puter clusters. Constellation implied puting Initiative (ASCI) resulted in Gflops. The NWT computer, basically multiprocessor nodes, MPP implied a supercomputers that were early rep- a cluster, held the position through particular vendor network. An early resentatives of this shift: the ASCI Red 1996 with 170 processors. The first SIMD was tried, made the list, and was from Intel, first in the TOP500 from Intel Sandia cluster with 3680 com- abandoned. The SMP category was am- June1997 to November 1999 and the puters was at the top for the June biguous since it included supercom- ASCI White from IBM, number one 1994 list. Thus, 1993 can be marked puters with vector processors and mul- from June 2000 to November 2001. In as the beginning of scalable, clus- tiple microprocessors that I defined as any case, although there were few vec- tered computing! In the same year, “multis.” tor processors in the TOP500 list in this the first draft of the Message Passing Thus, in 2018 every computer is a era, it was very ably represented by the Interface (MPI) standard was intro- multicomputer of some kind, and the vector supercomputer duced. A year later Donald Becker performance gains come from evolv- from Japan, which dominated the top and Thomas Sterling distributed the ing the computer nodes with some spot in 2002 and 2003 after the ASCI Beowulf source code that controlled form of accelerators beginning with an supercomputers. It should be added the interconnection and operation of attached floating unit. In 2012 a graph- that in this supposedly stagnant era, a network of UNIX computers. Thus, ics processing unit (GPU) was added to some select companies such as NEC at the end of the first five years, all the the Cray to establish it as the ar- has continued to design pure vector components were established. Figure chitecture de jour. Sunway has evolved processors “a la Cray” all the way from 2 shows the situation in terms of par- the powerful node architecture by 1983 to now. allelism and performance beginning building nodes with 260 processing el- Although few classical vector pro- in 1987 with the Bell Prize winners ements (cores), managed by a four-way cessors could remain in the TOP500, kicking off the transition. multiprocessor. their design philosophy continued to

OCTOBER 2018 79 VIRTUAL ROUNDTABLE

influence “killer micro” design and Compaq in 2000 and, finally, to the What sponsored network projects associated accelerators.9 For example, current “renaissance” of vector micro- come to mind as being noteworthy the inclusion of SIMD execution units processors. Contemporary examples during that time? in microprocessors could be consid- of vector microprocessors include the ered as a pseudo-vector unit. The ear- Intel Knights family of processors, the JOHNSTON: In the Corporation for liest SIMD units operated on short- NEC SX-Aurora12 and Fujitsu’s Post-K National Research Initiatives (CNRI) vectors of integer data. However, the supercomputer design.13 Gigabit Testbeds (~1990–1994) projects SIMD units of today are starting to re- supported by the NSF and DARPA, the semble traditional vector processors COMPUTER: How would you char- Casa testbed’s goal was direct, long with their ever-increasing operand size acterize the changes in HPC, espe- distance, high-speed communication (Intel AVX-512 operates with 512 bits) cially since the rapid proliferation of between supercomputers. The Los as well as by their added vector-like microcomputers? Alamos National Laboratory (LANL) functionality, such as support for scat- built an HIPPI (supercomputer local ter instructions in the Intel architec- BELL: “Scalability” characterizes this network) to SONET (wide-area optical ture. In parallel with SIMD evolution, past tri-decade. Clock speed only in- network) gateway to interconnect su- the architecture of 3D graphic acceler- creased a factor of 10, and gains were percomputers at LANL and SDSC. The ators in the ’90s started evolving from achieved by spending more to build 800 Mb/s HIPPI was stripped across narrow API-driven ASIC accelerators by scaling—that is, replicating, adapt- multiple 155 Mbps SONET channels into a more generalized form of com- ing, and interconnecting thousands of over a network path that was about pute called SIMT (Single Instruction, smaller, powerful computers derived 2,000-km long.14 Multiple Thread), basically, a marriage from the off-the-shelf personal com- The focus of the project was to in- between massive simultaneous mul- puting industry. The net result has terconnect supercomputers a “meta- tithreading and SIMD execution. The been a gain of almost one thousand computer” built from heterogeneous requirement to execute a SIMT instruc- per decade measured by Linpack— architecture systems. One goal was tion across multiple threads in lock- going from 2 Gflops (109) in 1988 to to couple an atmospheric circulation step in the GPU back-ends made this a likely 0.12 exaflops (1018) in 2018, model running at one site with an execution model quite similar to that or a factor of 60 million with thou- ocean circulation model running at employed by vector processors. Com- sands of interconnected computers. the other site.15 pared to NVIDIA GPUs, AMD’s GPUs The past 30 years is in contrast to the resemble vector processors even more first tri-decade plus (1958–1993) that COMPUTER: How did the suitability with their internal SIMD-vector units. allowed Cray to focus on building the of benchmarks for supercomputing These vector-processor–inspired GPUs largest commercially feasible single, evolve over time? laid the groundwork for an important shared memory multiple vector pro- market spanning 3D graphics, HPC, cessor computer for executing FOR- HORST SIMON: The simplest and most and, more recently, deep learning. TAN. In 1958, the IBM 709 vacuum universal ranking metric for scientific Finally, some select machines with tube computer operated at roughly 10 computing is floating-point operations a unique architecture—such as the Kflops (103), for a tri-decade gain of 2 per second (). This benchmark Roadrunner—borrowed from vector Gflops/10 Kflops or a factor of 200,000 would not be chosen to represent per- processors too. The Roadrunner was with the benefit of a thousand-fold formance of an actual scientific com- the first petaflops machine, number 1 clock increase. puting application, but should very in the TOP500 list from June 2008 to By 1960, all computers were tran- coarsely embody the main architec- June 2009, and it featured the Cell mi- sistorized enabling higher density tural requirements of scientific com- croprocessor from IBM. and faster clocks. Finally, hardware puting. We strongly felt that scientific In the meantime, the classical vec- engineering vs. software and pro- HPC was largely driven by integrated tor processors staged a comeback by gramming challenge delineates the large-scale calculations and therefore borrowing from the “killer micro” two tri-decades of high performance decided to avoid any overly simplis- ideas such as out of order processing. computing. A summary of the events tic benchmarks, such as embarrass- Led by pioneering academic designs marking progress over the last 60 ing parallel codes, which could have in UC Berkeley10 and UPC Barcelona11; years is shown in Figure 3. ranked systems very high, even if the idea of designing a “commodity” they were otherwise unsuited for scien- vector microprocessor became feasi- COMPUTER: Back in the 1990s, access tific computing. To encourage participa- ble. This then led to multiple tenta- to expensive supercomputers was a tion, we wanted a well-performing code tive proposals by the industry, such principal driver of the development of that would showcase the capability of as the Tarantula microprocessor from TCP/IP and high-speed interconnects. systems while not being overly harsh

80 COMPUTER WWW.COMPUTER.ORG/COMPUTER First tri-decade of mono-memory computing evolution to supercomputers.

1957: (rst high-level programming language) introduced for scientic and technical computing 1961: Univac LARC, IBM Stretch, and Manchester Atlas nish the race to build largest “conceivable” computers 1964: CDC 6600—world’s fastest supercomputer until 1969 1967: Amdahl’s law is presented and further discussed at the Spring Joint Computer Conference in Atlantic City 1969: CDC 7600 replaces CDC 6600 as world’s number one 1976: Cray 1 installed in Los Alamos National Laboratory 1982: Cray X-MP—shared memory vector multiprocessor 1988: Cray 8-processor Y-MP announced operating at a peak of 4 Gšops

Multicomputer machines become useful and cost-effective

1982/83: The distributed memory Caltech Cosmic Cube becomes operational with 8/64 nodes 1987: nCUBE (1024 nodes) delivers 400-600 speedup on specic applications and the team at Sandia National Labs wins rst Gordon Bell Prize 1988: First Supercomputing conference 1993: Top500 established at prize using Linpack Benchmark and CM5 is the rst winner 1994: The Beowulf cluster kit recipe for low-cost multicomputers and the MPI-1 Standard are published 1995: Launched ASCI > the Advanced Simulation and Computing (ASC) Program 1997: The ASCI Red (1 Tšops) becomes operational at Sandia National Labs, with 9152 nodes 2002: The Japanese Earth Simulator stays for 3 years as the fastest supercomputer at 35 Tšops 2008: IBM BlueGene at Los Alamos National Laboratory reaches the Pšops barrier (1.5 Pšops) 2012: Cray Titan (17.6 Pšops) demonstrates the use of GPU and CUDA 2016: The Chinese Sunway Taihulight supercomputer achieves 93 Pšops with 40,960 3.5 Pšops nodes composed of 10M cores 2018: At 122 Pšops, Summit is less than an order of magnitude away from the Exašops barrier

Figure 3. Supercomputer evolution events.

or restrictive. Obviously, no single the need for elaborate run-rules and STROHMAIER: While we started the benchmark can ever hope to represent procedures to enforce the full usage TOP500 to provide statistics about the or approximate performance for the of computer systems, which is sim- HPC market at specific dates, it became majority of scientific computing ap- ilar to what many applications do. immediately clear that the inherent plications as the space of algorithms These features made Linpack the ability to track the evolution of super- and implementations is too vast to al- obvious choice for our ranking. Hav- computer systems over time in a sys- low this. The purpose of using a single ing selected a single benchmark for tematic way was even more valuable. benchmark in the TOP500 was never comparability implies several other Any edition of the TOP500 includes to claim such representativeness, but limitations. In Linpack, the number a mix of new and older installations, to collect reproducible and compara- of operations is not measured but cal- systems, and technologies. Figure 4 ble performance numbers. culated with a simple formula based shows the changes in performance Linpack is nowadays sometimes on the problem size and the computa- growth since the introduction of the criticized as an overly simplistic prob- tional complexity of the original algo- TOP500 list in 1993. lem. The HPL (High Performance rithm. Therefore, the TOP500 cannot Linpack) code comes with a self-ad- provide any basis for research into COMPUTER: From a networking per- justable problem size, which allowed algorithmic improvements over time. spective, what are some of the chal- it to be used seamlessly on systems of Linpack and HPL could certainly be lenges that the community has encoun- vastly different sizes. As opposed to used for such comparisons of algo- tered? What models and architectural many other benchmarks with vari- rithmic improvements, but not in the approaches have been developed within able problem sizes, HPL achieves its context of the TOP500 ranking. the HPC community to mitigate these best performance for large problems issues for the scientific user? which use all the available memory COMPUTER: Have the TOP500 data ever and not for small problems which fit shown a change in the performance JOHNSTON: There were several rea- into the cache. This greatly reduces growth rate of installed systems? sons ten years ago why remote user

OCTOBER 2018 81 VIRTUAL ROUNDTABLE

1.22 Eops 1 Eops 122 Pops 100 Pops

10 Pops SUM

1 Pops N=1 715 Tops 100 Tops

10 Tops N=500

1 Tops 1.17 Tops 100 Gops 59.7 Gops 10 Gops

1 Gops 400 Mops 100 Mops 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018

Figure 4. Performance development of supercomputers as tracked by the TOP500. The green line shows the performance for the highest-performing system on the list, the light blue line for the lowest system (No. 500), and the dark blue line shows the sum of the performance of all systems on the TOP500.

data transfer rates to supercomputers The ScienceDMZ is a special campus The national research and educa- had not significantly increased. LAN network domain that is built outside tion networks (NRENs) of the Ameri- network devices are frequently poorly the site perimeter but directly adjacent cas, Europe, and Southeast Asia have configured for, or even incapable of, re- to the site LAN so that it can share LAN extended their multi-hundred Gbps ceiving high-speed data streams from connections with the site. It consists backbones across the Atlantic and Pa- WAN devices. Storage systems have the of a WAN-capable network device, and cific oceans, providing high-speed data ability to move data at high speed in the a small number of high-performance access internationally. (Transatlantic LAN but are rarely configured to move data transfer systems (“data transfer R&E bandwidth now at record-breaking data at high speed in the WAN environ- nodes” [DTN]). The DTNs typically also 740 Gbps.) Such high-speed networks ment. Site security at universities and have a connection to the campus LAN are essential for getting very large laboratories was typically handled by that does not go through the site fire- amounts of data from instruments to a (relatively low performance) firewall wall, but data transfers in either direc- supercomputers.17 through which all traffic had to pass to tion have to be initiated from within ESnet has deployed 400 Gbps link get to computing systems on campus. the site. The control channels for these technology, providing NERSC comput- This is not a problem for thousands of transfers go through the site firewall. ers with remote access to cache disks simultaneous small data streams (e.g., Cybersecurity within the ScienceDMZ and mass storage systems. A similar web traffic), but is a severe impediment is accomplished by well understood approach is used at CERN where the for high-speed, long-duration streams server configurations on the DTNs that local disk cache is divided across the for data-intensive science. only run software needed to do data CERN Geneva site and the Wigner Data To achieve end-to-end high-speed transfer. Access control is managed Center in Budapest. This technology, data throughput for large volume sci- with an access control list (ACL) on probably at the Tbps level, will also ence data all of these issues had to the ScienceDMZ WAN network device. likely be used to connect the next gen- be addressed. Discussions between These ACLs restrict access to external eration Linac Coherent Light Source at ESnet (DOE’s Office of Science’s WAN sites that are identified as collaborators the SLAC laboratory to NERSC. network) and the NERSC supercom- that have a valid reason to exchange Since the 1980s, network access to puter center in the early part of 2000 data with a scientist on campus. This supercomputers, and the correspond- established some basic principles concept has been very successful and ing ability to move vast amounts of from which ESnet developed a net- is now deployed at more than a hun- data to and from supercomputers, has work architecture called the “Sci- dred laboratories, research universi- increased by more than nine orders enceDMZ” that addressed the issues. ties, and supercomputer centers.16 of magnitude. This increase is based

82 COMPUTER WWW.COMPUTER.ORG/COMPUTER on improvements in architectures, software, operating systems, and net- work technology, much of which was SUPERCOMPUTING HISTORY enabled by research and development funding from science-oriented gov- PRESENTATIONS ernment agencies. »» Gordon Bell, “View of History of Supercomputers,” presentation, Law- COMPUTER: What are you looking rence Livermore National Lab, 24 April 2013; https://www.youtube for primarily in an additional comple- .com/watch?v=e5UbGgRGGOk. mentary benchmark for the TOP500? »» Gordon Bell, “Three Decades of the Gordon Bell Prize,” presentation, Frontiers of Computing, March 2017; https://www.youtube.com DONGARRA: Most requests for new /watch?v=NZIG0o3_3No. benchmarks usually center on the »» Gordon Bell, “Marking 30 Years’ History of the Gordon Bell Prize,” presenta- argument that Linpack is—at least tion, SC Conference, Nov. 2017; https://youtu.be/4LCXbpssV1w. at present—a poor proxy for applica- »» Jack Dongarra, Erich Strohmaier, and Horst Simon, “Top500: Past, Present, tion performance and that a “better” and Future,” presentation, SC Conference, Nov. 2017; https://youtu.be benchmark is needed. When HPL /eIZZbfrW87M. gained prominence as a performance metric in the early 1990s, there was a strong correlation between its predic- tions of system rankings and the rank- ing realized by full-scale applications. ranking would need to be developed, databases, personalized medicine or In these early years, computer system which is very like the situation for data- deep neural networks? vendors pursued designs that would intensive computing. increase HPL performance, thus im- The TOP500 collection has enjoyed VALERO: Modern applications such as proving overall application function. incredible success as a metric for the deep neural networks (DNNs), database However, many aspects of the phys- HPC community. The trends it ex- management systems (DBMSs) for big ical world are modeled with PDEs, poses, the focused optimization efforts data, and personalized medicine (PM) which help predictive capability, aid- it inspires and the publicity it brings are much more amenable for efficient ing scientific discovery and engineer- to our community are very import- execution on vector processors. Note ing optimization. The High-Perfor- ant. As we are entering a market with that current DNN applications typi- mance Conjugate Gradients (HPCG) growing diversity and differentiation cally feature multiply add operations benchmark18 is a complement to the of architectures, a careful selection of on huge vectors of data and can ben- HPL benchmark and now part of the appropriate metrics and benchmarks efit from vector architectures as well TOP500 effort. It is designed to exer- matching the needs of our applica- as DBMSs and PM applications such as cise computational and data access tions is more necessary than ever. gene sequencing that operate on very patterns that more closely match a HPL encapsulates some aspects of long vectors with integer operations. different yet broad set of important ap- real applications such as strong de- plications, and to encourage computer mands for reliability and stability of BELL: The massive computer cluster system designers to invest in capabili- the system, for floating point perfor- with highly nodes ties that will impact the collective per- mance, and to some extent network describes today’s architecture path. formance of these applications. performance, but no longer tests mem- Will this general structure be ade- ory performance adequately. Alterna- quate to get to exaflops and beyond, COMPUTER: How do you see the fu- tive benchmarks as a complement to with a clock speed stalled at a few ture developments of supercomputer HPL could provide corrections to indi- GHz? So far two paths have emerged performance rankings? vidual rankings and improve our un- based on advances in AI including the derstanding of systems but are much construction of large neural nets for SIMON: Clearly, the current approach less likely to change the magnitude of recognition: specialized chips, such as for compiling the TOP500 cannot ad- observed technological trends. Google’s TPU and FPGAs programmed dress truly novel architectures such for the application. as neuromorphic systems or quantum COMPUTER: Do you see any emerging computers. Should a market for such applications outside of the classical COMPUTER: Networking is just one el- systems develop, very domain-spe- HPC domain? How about the appli- ement of an HPC infrastructure as re- cific approaches to benchmarking and cability of supercomputing ideas to flected in the SC conference topic areas

OCTOBER 2018 83 VIRTUAL ROUNDTABLE

that have come to include not just per- formance and networking, but also storage, data analytics, and visualiza- ABOUT THE AUTHORS tion. What project do you consider an exemplar of a current state of the art infrastructure? JACK DONGARRA holds an appointment at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester. He specializes JOHNSTON: By far the largest scien- in numerical algorithms in linear algebra, parallel computing, use of advanced tific experiment today is the Large computer architectures, programming methodology, and tools for parallel Hadron Collider at CERN. Data from computers. He was awarded the IEEE Sidney Fernbach Award in 2004; in the several detectors/experiments 2008 he was the recipient of the first IEEE Medal of Excellence in Scalable on the LHC are distributed to several Computing; in 2010 he was the first recipient of the SIAM Special Interest Group thousands of scientists at some 200 on Supercomputing’s award for Career Achievement; in 2011 he was the recipi- institutions in more than 40 countries ent of the IEEE Charles Babbage Award; and in 2013 he received the ACM/IEEE for analysis. This results in petabytes/ Ken Kennedy Award. He is a Fellow of the AAAS, ACM, IEEE, and SIAM; he is a day of data movement. Some of this foreign member of the Russian Academy of Science, and a member of the US involves the use of supercomputers, National Academy of Engineering. Contact him at [email protected]. but even more so the technology and skills needed to accomplish this sort of VLADIMIR GETOV is a professor of distributed and high-performance com- data management are moving into the puting (HPC) and leader of the Distributed and Intelligent Systems research supercomputing environment as su- group at the University of Westminster. His research interests include parallel percomputers are increasingly used to architectures and performance, energy-efficient computing, autonomous manage and analyze the vast amounts distributed systems, and HPC programming environments. Getov received a of data from modern scientific instru- PhD and DSc in computer science from the Bulgarian Academy of Sciences. In ments. These instruments are almost 2016 he was the recipient of the IEEE Computer Society Golden Core Award. always remote from supercomputers Getov is a Senior Member of IEEE, a member of ACM, a Fellow of the British and involve collaborations that are Computer Society, and he is Computer’s area editor for HPC. Contact him at widely distributed. Moving petabytes [email protected]. of data into and out of supercomputer centers from remote experiments re- KEVIN WALSH is a student of the history of HPC. He is the supercomputing quired new technologies. history project lead for the 30th anniversary of the IEEE/ACM SC conference in 2018. Previously a systems engineer at the San Diego Supercomputer COMPUTER: How do you see the ho- Center, he is currently at the Institute of Geophysics and Planetary Physics listic approach between applications, at the Scripps Institution of Oceanography. Walsh received a BA in history of programming models, runtime sys- science and an MAS in computer science and engineering at the University tems and architecture in the future of California, San Diego. He is a member of ACM, the IEEE Computer Society, supercomputers? and the Society of History of Technology. Contact him at [email protected].

VALERO: Looking forward, we see three developments that might fa- cilitate the resurgence of vector pro- cessors: technology evolution, emer- to the issue of high bandwidth re- done in current scalar ISAs. This will gence of modern applications, and quirements of vector processors. We allow decoupling the frontend and runtime-aware architecture. Let us envisage that instruction set archi- backend of processors and to explic- consider each in turn. In a back-to-the- tectures (ISA) supporting operations itly manage locality (long register future sense, technological advances, on long vectors or matrix structures files, “command vectors”) optimizing similar to the early vector supercom- will play an important role in the fu- the memory throughput. puter period, drive the new vector re- ture. The high semantic level of such naissance. For example, the memory operations and their tight coupling to stacking technology such as HBM, modern runtimes will allow program- o summarize—vector proces- which delivers high bandwidth DRAM mers to convey semantic information sors were paramount at the very systems, is hugely advantageous they already have (on locality, depen- Tbeginning of supercomputing for designs since it dences) to the architecture, reducing from the Cray 1 in 1976 to the Convex provides a good technology solution the need to rediscover as it has been C4 in 1994. Despite the “Attack of the

84 COMPUTER WWW.COMPUTER.ORG/COMPUTER Killer Micros,” vector processors never 9. M. Valero, R. Espasa, and J.E. Smith, disappeared and now they could be the “Vector Architectures: Past, Present crème de la crème of supercomputers and Future,” Proc. ACM Int’l Conf. once again. In addition, ISA operations Supercomputing (ICS 98), 1998, that represent a very large amount of pp. 425-432. work offer the possibility to keep active 10. K. Asanovic, “Vector microproces- a large number of functional units. This sors,” PhD Thesis, University of will allow the development of energy California, Berkeley, 1998; http:// efficient systems for dedicated highly people.eecs.berkeley.edu/~krste critical applications such as AI applied /thesis.html. to personalized medicine or self-driven 11. R. Espasa, “Advanced Vector Archi- vehicles. Programming models and tectures,” PhD Thesis, Universitat Po- runtime systems will need to adapt and litechnica de Catalunya, 1997; http:// support this new approach driving the citeseerx.ist.psu.edu supercomputing performance well be- /viewdoc/download?doi=10.1.1.51 yond the exascale barrier. .9455&rep=rep1&type=pdf. 12. “NEC SX- TSUBASA—Vector REFERENCES Engine,” NEC Corporation, 1994- 1. G.M.Amdahl, “Computer Architecture and 2018; www..com/en/global Call Amdahl’s Law,” Computer, vol. 46, no. 12, /solutions/hpc/sx/vector_engine for 2013; pp. 38-46. .html. 2. “Cray-1 Computer System Hardware 13. T. Shimizu, “Post-K Supercomputer Articles Reference Manual,” Publication with Fujitsu’s Original CPU, Powered IEEE Software seeks practical, readable 2240004, Rev C, 4 Nov. 1977, Cray Re- by ARM ISA,” Teratec Forum; www articles that will appeal to experts and search, Inc.; http://history-computer .teratec.eu/library/pdf/forum/2017 nonexperts alike. The magazine aims .com/Library/Cray-1_Reference%20Manual. /Presentations/04_Toshiyuki to deliver reliable, useful, leading-edge pdf. _Shimizu_Fujitsu_Forum_Teratec information to software developers, engineers, 3. J.J. Dongarra, P. Luszczek, and A. Pe- _2017.pdf. and managers to help them stay on top of titet, “The Linpack Benchmark: Past, 14. “The Gigabit Testbed Initiative, Final rapid technology change. Topics include Present and Future,” Concurrency Report,” CNRI, Dec. 1996; www.cnri requirements, design, construction, tools, Computat.: Pract. Exper., vol. 15, 2003, .reston.va.us/gigafr. project management, process improvement, pp. 803–820; doi: 10.1002/cpe.728. 15. W. Minkowycz, “Advances in Numer- maintenance, testing, education and training, 4. E. Strohmaier, et al., “The Market- ical Heat Transfer, Volume 2,” CRC quality, standards, and more. place of High-Performance Com- Press, 5 Dec. 2000. puting,” Parallel Computing, vol. 25, 16. E. Dart, et al., “The Science DMZ,” Author guidelines: no. 1517, 1999. Proc. Int’l Conf. High Performance www.computer.org/software/author 5. E. Strohmaier, et al., “The TOP500 List Computing, Networking, Storage and Further details: [email protected] of Supercomputers and Progress in Analysis (SC 13), 2013. www.computer.org/software High Performance Computing,” Com- 17. “North Atlantic Network Collabora- puter, vol. 48, no. 11, 2015, pp. 42–49. tion Building Foundation for Global 6. G. Bell, et al., “A Look Back on 30 Network Architecture,” Energy Sci- Years of the Gordon Bell Prize,” Int’l J. ences Network, 17 Apr. 2017; http:// High Performance Computing Applica- es.net/news-and-publications tions, vol. 31, no. 6, 2017, pp. 469–484. /esnet-news/2017/north-atlantic IEEE 7. W. Johnston, “High-Speed, Wide -network-collaboration-building Area, Data Intensive Computing: -foundation-for-global-network A Ten-Year Retrospective,” Proc. -architecture. 7th IEEE Symp. on High Performance 18. J.J. Dongarra, M.A. Heroux, and Distributed Computing, 1998. P. Luszczek, “High-performance 8. J. Markoff, “The Attack of the ‘Killer conjugate-gradient benchmark: A Micros,’” The New York Times, 6 May new metric for ranking high-perfor- Read your subscriptions 1991; www.nytimes.com/1991 mance computing system,” Int’l J. through the myCS publications portal at /05/06/business/the-attack-of High Performance Computing Applica- http://mycs.computer.org -the-killer-micros.html. tions, vol. 30, no. 1, 2015, pp. 3–10.

OCTOBER 2018 85