The TOP500 List and Progress in High- Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
COVER FEATURE GRAND CHALLENGES IN SCIENTIFIC COMPUTING The TOP500 List and Progress in High- Performance Computing Erich Strohmaier, Lawrence Berkeley National Laboratory Hans W. Meuer, University of Mannheim Jack Dongarra, University of Tennessee Horst D. Simon, Lawrence Berkeley National Laboratory For more than two decades, the TOP list has enjoyed incredible success as a metric for supercomputing performance and as a source of data for identifying technological trends. The project’s editors refl ect on its usefulness and limitations for guiding large-scale scientifi c computing into the exascale era. he TOP list (www.top.org) has served TOP500 ORIGINS as the de ning yardstick for supercomput- In the mid-s, coauthor Hans Meuer started a small ing performance since . Published twice a and focused annual conference that has since evolved year, it compiles the world’s largest instal- into the prestigious International Supercomputing Con- Tlations and some of their main characteristics. Systems ference (www.isc-hpc.com). During the conference’s are ranked according to their performance of the Lin- opening session, Meuer presented statistics about the pack benchmark, which solves a dense system of linear numbers, locations, and manufacturers of supercomput- equations. Over time, the data collected for the list has ers worldwide collected from vendors and colleagues in enabled the early identi cation and quanti cation of academia and industry. many important technological and architectural trends Initially, it was obvious that the supercomputer label related to high-performance computing (HPC).− should be reserved for vector processing systems from Here, we brie y describe the project’s origins, the companies such as Cray, CDC, Fujitsu, NEC, and Hitachi principles guiding data collection, and what has made that each claimed to have the fastest system for scienti c the list so successful during the two-decades-long tran- computation by some selective measure. By the end of sition from giga- to tera- to petascale computing. We also the decade, however, the situation became increasingly examine the list’s limitations. The TOP’s simplicity complicated as smaller vector systems became available has invited many criticisms, and we consider several from some of these vendors as well as new competitors complementary or competing projects that have tried (Convex, IBM) and as massively parallel systems with to address these concerns. Finally, we explore several SIMD architectures (Thinking Machines, MasPar) and emerging trends and re ect on the list’s potential useful- MIMD systems based on scalar processors (Intel, nCube, ness for guiding large-scale HPC into the exascale era. and others) entered the market. Simply counting the 42 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/15/$31.00 © 2015 IEEE 100 Pops 33.9 Pops 10 Pops 1 Pops 100 Tops No. 1 system 165 Tops 10 Tops Average 1 Tops 59.7 Gops Performance 100 Gops No. 500 system installation base for systems of such 10 Gops vastly di erent scales did not produce 1 Gops any meaningful data about the mar- 400 Mops 100 Mops ket. New criteria for which systems constituted supercomputers were 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 Year needed. After two years experimenting FIGURE . Supercomputer performance over time as tracked by the TOP. The red with various metrics and approaches, and orange lines show performance of the fi rst and last systems, and the blue line aver- Meuer and coauthor Erich Strohmaier age performance of all systems. Dashed lines are fi tted exponential growth curves before concluded that the best way to pro- and after for the orange line and before and after for the blue line. vide a consistent, long-term picture of the supercomputer market was to maintain a list of systems up to a pre- but this would lead to inconsistent val- Using a single benchmark that does determined cuto number, ranked ues and make comparisons di cult. To not utilize all the system components according to their actual performance. address this problem, we opted to select necessary for most scienti c applica- On the basis of previous studies they and mandate use of a single benchmark tions or that maps better to particular determined that at least quali ed for all TOP editions. computer architectures could lead to systems could be assembled, and so This benchmark would not repre- misleadingly high performance num- the TOP list was born. sent performance of an actual scienti c bers for some systems, incorrectly indi- application but coarsely embody scien- cating these systems’ suitability for sci- RANKING SUPERCOMPUTER ti c computing’s main architectural enti c computing. To minimize such PERFORMANCE requirements. Because scienti c com- implicit bias, we decided that the bench- The simplest and most universal rank- puting is primarily driven by integrated mark should exercise all major system ing metric for scienti c computing is large-scale calculations, we decided components and be based on a relatively oating-point operations per second to avoid using simplistic benchmarks, simple algorithm that allows optimiza- ( ops). More specialized metrics such such as embarrassingly parallel work- tion for a wide range of architectures. as time to solution or time per iteration loads, that could lead to very high rank- and time per gridpoint can be more ings for systems otherwise unsuitable LINPACK meaningful in particular application for scienti c computing. Instead, we An evaluation of benchmarks suitable domains and allow more detailed sought a benchmark that would show- for supercomputing in the early s comparisons—for example, between case systems’ capabilities without found that Linpack had the most doc- alternative algorithms with di erent being overly harsh or restrictive. Over- umented results by a large margin and complexities—but are harder to de ne all, the collected data should provide thus allowed immediate ranking of properly, more restricted in their use, reasonable upper bounds for actual most of the systems of interest. The and, due to their specialization, not performance while penalizing systems NAS Parallel Benchmarks (NAS PB) applicable to the overall scienti c unable to support a large fraction of sci- were also widely used, as most of them computing market. enti c computing applications. simulated actual application perfor- In addition to limiting performance Obviously no single benchmark mance more closely, but none of them measurement to ops, we decided to use can ever hope to represent or approx- provided enough results to rank more actual measured values to avoid con- imate performance for most scienti c than percent of the systems. taminating collected data with unsub- computing applications, as the space Linpack solves a dense system of stantiated and often outlandish per- of algorithms and implementations is linear equations, which today is some- formance “estimates” for systems that too vast. The purpose of using a single times criticized as an overly simplis- did not reliably function or even exist. benchmark in the TOP was never tic problem. However, the benchmark In principle, measured results from to claim such representativeness but is by no means embarrassingly par- di erent benchmarks or applications to collect reproducible and compara- allel and it worked well with respect could be used to rank di erent systems, ble metrics. to reducing the rankings of loosely NOVEMBER 2015 43 GRAND CHALLENGES IN SCIENTIFIC COMPUTING 10,000 1,000 100 predict. Figure 1 shows performance No. of processor sockets values for the first and last systems as well as average performance of all sys- 10 tems in the TOP500. Until 2008, these curves grew exponentially at a rate of 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 1.91 per year (multiplicative factor). Year Compared to the exponential growth FIGURE 2. Average number of processor sockets for new supercomputers in the TOP500, rate of Moore’s law at 1.59 per year, excluding systems with SIMD processors, vector processors, or accelerators. The exponen- TOP500 system performance had an tial increase in the number of sockets up to 2008 accounts for the higher-than-expected excess exponential growth rate of 1.20 growth rate in supercomputing performance during the same time period. per year. We suspected that this addi- tional growth was driven by an increas- ing number of processor sockets in our coupled architectures, which were of that they do not reduce the number of system sample. (We use the term “pro- limited use to scientific computing in floating-point operations performed. cessor sockets” to clearly differentiate general. The High-Performance Lin- The TOP500 therefore cannot provide processors from processor cores.) pack (HPL) implementation comes any basis for research into algorithmic To better understand this and other with a self-adjustable problem size, improvements over time. Linpack and technological trends contained in the which allows it to be used seamlessly HPL could certainly be used to compare TOP500 data, we obtained a clean and on systems of vastly different sizes as algorithmic improvements, but not in uniform subsample of systems from compared to discrete, fixed sizes for the context of the TOP500 ranking. each edition of the list by extracting the NAS PB. Unlike many other bench- the new systems and those systems marks with variable problem sizes, TOP500 TRENDS that did not use special processors HPL achieves its best performance on Although we started the TOP500 to with vastly different characteristics large-scale problems that use all of a provide statistics about the HPC mar- including SIMD processors, vector system’s available memory and not on ket at specific dates, it became imme- processors, or accelerators (such as small problems that fit into the cache. diately clear that the inherent ability Nvidia GPUs and Intel Phi coproces- This greatly reduces the need for elab- to systematically track the evolution sors).