<<

N o v e l A r c h i t e c t u r e s

Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected]

Trends in High-Performance

By Volodymyr Kindratenko and Pedro Trancoso

HPC system architectures are shifting from the traditional clusters of homogeneous nodes to clusters of heterogeneous nodes and accelerators.

e can infer the future of from traditional clusters of homo- at number six and deployed at the high-performance com- geneous nodes to clusters of hetero- French Atomic and Alternative Ener- W puting (HPC) from the geneous nodes (CPU+GPU). As the gies Commission. technologies developed today to show- sidebar, “Architecture of Current Peta­ The Tianhe-1A is composed of case the leadership-class compute Systems” describes, three out 7,168 nodes—each containing two systems—that is, the . of seven systems that achieved over X5670 hex-core (West- These machines are usually designed one quadrillion flops (petaflops) on mere) processors and one to achieve the highest possible perfor- the standard tool Linpack Tesla M2050 GPU. The is a mance in terms of the number of 64-bit (www..org/project/linpack) are more traditional XT system floating-point operations per second Nvidia GPU-based, and two out of consisting of 18,688 compute nodes (flops). Their architecture has evolved the three largest systems are deployed containing dual hex-core AMD Op- from early custom design systems to in . teron 2435 (Istanbul) processors. The the current clusters of commodity The number one system on the highest-performing GPU-based US- multisocket, multicore systems. Twice November 2010 Top-500 list is Tianhe- built system, deployed at Lawrence a year, the supercomputing commu- 1A, deployed at the National Su- Livermore National Laboratory, is nity ranks the systems and produces percomputing Center in Tianjin. It number 72 on the Top-500 list. We’ve the Top-500 list (www.top500.org), achieves about 2.57 petaflops on Lin- yet to see any substantially larger which shows the world’s 500 high- pack with a theoretical peak perfor- GPU-based HPC systems deployed in est performing machines. As we now mance of 4.7 petaflops. The number the US. Instead, we see many research describe, the technologies used in the three system is , deployed at centers deploying small and mid- top-ranked machines give a good indi- the National Supercomputing Centre range GPU systems, such as the US cation of the architecture trends. in Shenzhen. It achieves 1.27 petaflops National Science Foundation (NSF) on the Linpack benchmark with a Track 2D Keeneland system, which The Top-500 theoretical peak of almost 3 petaflops. is in its initial deployment phase at The November 2010 Top-500 list The highest performing US system, the Georgia Institute of Technol- of the world’s most powerful super- Jaguar, deployed at the Oak Ridge ogy (number 117 on the Top-500 stressed two noticeable National Laboratory, is number two list), or the Forge GPU HPC cluster developments in HPC: the advent of on the Top-500 list, achieving 1.75 currently being developed at the US HPC clusters based on graphical pro- petaflops with theoretical perfor- National Center for Supercomput- cessing units (GPUs) as the dominant mance of 2.3 petaflops. The number ing Applications at the University of petascale architecture, and the rise of four system on the November 2010 Illinois. The Keeneland project in China as a dominant player in the su- Top-500 list is the -built Tsub- particular is interesting in that a part percomputing arena. ame 2.0 system, which is a GPU-based of the effort is devoted to developing To build more powerful machines, system. Europe’s most powerful sys- the software infrastructure neces- system architects are moving away tem is the Tera-100, which is ranked sary to utilize the GPUs in an HPC

92 Copublished by the IEEE CS and the AIP 1521-9615/11/$26.00 © 2011 IEEE Computing in Science & Engineering

CISE-13-3-Novel.indd 92 05/04/11 6:16 PM Architecture of Current benchmark with a theoretical peak of about 2.3 peta­ flops. It consists of 1,442 compute nodes containing Petaflops Systems (mostly) six-core Intel Xeon X5670 (Westmere) CPUs, or 73,278 processor cores in total, and three Nvidia he November 2010 Top-500 list included seven Tesla M2050 GPUs per node. The system used Quad petaflops systems: T Data Rate Infiniband interconnect. 1. Tianhe-1A, deployed at the National Supercomputing 5. Hopper, deployed at the US National Energy Research Center in Tianjin, China, achieves about 2.57 petaflops Scientific Computing Center, achieves just over a peta­ on Linpack with a theoretical peak performance of flops on Linpack, with a theoretical peak of almost 1.3 4.7 petaflops.T ianhe-1A is composed of 7,168 nodes, petaflops.H opper is a Cray XE6 system consisting of each containing two Intel Xeon EM64 X5670 six-core 6,384 nodes containing 12-core AMD Operon proces- (Westmere) processors and one M2050 sors, or 153,216 processor cores in total. The system GPU. The system uses a proprietary interconnect. uses a custom interconnect. 2. Jaguar, deployed at the Oak Ridge National Labora- 6. Tera-100, deployed at the Alternative Energies and tory, US, achieves 1.75 petaflops with theoretical per- Atomic Energy Commission, , achieves 1.05 formance of 2.3 petaflops. Jaguar is a Cray XT system petaflops onL inpack with a theoretical peak of 1.25 consisting of 18,688 compute nodes containing dual petaflops.T he system consists of 4,300 bullx S series six-core AMD 2435 (Istanbul) processors, server nodes based on eight-core Intel EM64 Xeon or 224,256 processor cores total. The system uses a 7500 (Nehalem) processors, or 138,368 proces- proprietary interconnect. sor cores total. The system used Quad Data Rate 3. Nebulae, deployed at the National Supercomputing Infiniband interconnect. Centre in Shenzhen, China, achieves 1.27 petaflops 7. Roadrunner, deployed at Los Alamos National Labora- on the Linpack benchmark with a theoretical peak tory, US, was the first in the world of nearly 3 petaflops. Nebulae is composed of about to achieve over one petaflops. It currently stands at 4,700 compute nodes containing Intel EM64 Xeon six- 1.04 petaflops onL inpack, with theoretical peak of core X5650 (Westmere) processors, or 120,640 pro- 1.38 petaflops. Its main workforce is a nine-core IBM cessor cores total, and Nvidia Tesla C2050 GPU. The PowerXCell 8i experimental chip. The system consists system used Quad Data Rate Infiniband interconnect. of 12,960 PowerXCell 8is and 6,480 dual-core AMD 4. 2.0, deployed at the Tokyo Institute of Tech- , or 122,400 processor cores total, and uses nology, Japan, achieves 1.19 petaflops on the Linpack Voltaire Infiniband interconnect.

environment. It’s also NSF’s first 1.75/2.3 = 0.76, or 38 percent above The Green and Graph Lists system funded under the Track 2 ex- Tianhe-1A. The difference is even As the “Metrics and Benchmarks” perimental/innovative design program. more pronounced when considering sidebar describes, organizations use The Chinese HPC community’s real scientific workloads. various metrics and applications approach to designing high-end sys- Also, software availability for large- to evaluate computing systems. In tems is certainly worth note. In terms scale GPU-based systems is limited. the past, supercomputers were de- of raw performance, adding a GPU to While numerous applications have veloped for the single highest per- a conventional HPC cluster node can been ported to GPU-based systems formance goal (as measured by quadruple its peak performance, or over the past two years, many of the Linpack), but current system devel- even increase it by an order of mag- widely used scientific supercomputing opment is driven by two additional nitude when using 32-bit arithmetic. codes that have been developed over factors: power consumption and But the increased peak performance the past 20 years have yet to be rewrit- complex applications. Traditional doesn’t necessarily translate into sus- ten for GPUs. Many such applications approaches to achieving perfor- tained application performance. As have outlived several generations of mance have reached an unbearable an example, take Tianhe-1A, with architectures; it would be power consumption cost. As such, sustained 2.57 petaflops on Linpack irrational to scrap all the prior work the community has proposed a sec- and theoretical peak performance of each time a new and drastically dif- ond list—the Green-500 list (www. 4.7 petaflops. Its efficiency in terms of ferent architecture is presented, even .org)—that ranks systems sustained versus peak performance is if its performance is significantly according to their performance/ 2.57/4.7 = 0.55. Jaguar, on the other higher. Not surprising, we have yet power metric. The Green-500 list hand, achieves 1.75 petaflops on Lin- to see any applications that can take sorts systems from the Top-500 list by pack with theoretical performance of advantage of Tianhe-1A’s computing their power efficiency; it also shows 2.3 petaflops. Thus, its efficiency is power. the trend toward heterogeneous

May/June 2011 93

CISE-13-3-Novel.indd 93 05/04/11 6:16 PM N o v e l A r c h i t e c t u r e s

Metrics and Benchmarks power consumption for executing the benchmark on the system. It thus uses flops per watt (total power consump- upercomputer systems are evaluated according to tion for the program’s execution). Sdifferent metrics using different applications. We show here the three most relevant ranking lists and present the The Graph-500 benchmarks and metrics they use to rank systems. A graph-based application ranks the systems for this list. The authors are planning to use three versions—one for The Top-500 shared-memory systems, one for distributed-memory T o rank systems, the Top-500 uses the Linpack benchmark systems, and one for cloud systems using the map- program to solve a dense system of linear equations using reduce model. This benchmark represents 3D physics the LU decomposition method. Because this application is simulation applications. In contrast to Linpack, which is a regular, it performs well and thus gives a good indication compute-intensive application, this graph-based applica- of the system’s peak performance. The metric used to rank tion is data intensive. It’s also composed of two kernels: the systems in the Top-500 list is the number of floating- one for building a graph out of the raw data and another point operations per second (flops), which is measured to operate on the graph. The structure is a weighted from the execution of the Linpack benchmark. undirected graph; the second kernel operates on it based on a breadth-first search (BFS) algorithm. The metric The Green-500 used to rank this list’s systems—the number of traversed T he application used to rank the Green-50 list is also edges per second—maps closely to the application Linpack. In this case, we measure both performance and domain.

Upcoming 10+ Petaflops • , a GPU-based 20-petaflops system at Oak Ridge National Laboratory, US; System Architectures • , an IBM Power7-based 10-petaflops system at the US National Center for Supercomputing Applications; ased on recent announcements, we’ve identified • , an SGI ICE 10-petaflops system deployed several upcoming systems with aggregated peak B at NASA, US, is expected to be upgraded to reach performance of more than 10 petaflops: 10-petaflops peak; and • , an IBM Blue Gene/Q-based 20-petaflops sys- • K supercomputer, a Sparc64-based 10-petaflops tem at Lawrence Livermore National Laboratory, US; at the Riken Research Institute, Japan.

systems. Four of the top 10 sys- as data transfer speed between mem- Future Trends tems are GPU-based (both Nvidia ory, CPU, disk, and compute nodes, No doubt, we’ll see more GPU-based and ATI/AMD). Furthermore, the which is critical for most scientific large-scale systems. However, as the Green500 list’s top system is the IBM applications to achieve high, sustained sidebar “Upcoming 10+ Petaflops BlueGene/Q, which is built from performance. System Architectures” shows, most simple, power-efficient processors To take that into account, the of the upcoming very large-scale sys- originally developed for embedded community has proposed using tems actually aren’t based on GPUs. systems. These processors also in- graph-based applications to evaluate In the US, two 20-petaflops sys- clude several task specific acceleration the best supercomputers for complex tems are currently under construc- engines. data-intensive applications. With the tion: the Sequoia system being built The Linpack application used to exception of the number one system, at Lawrence Livermore National benchmark the Top-500 list systems which is a BlueGene/P architecture Laboratory and the Titan system to is somehow unrealistic when com- composed of power-efficient pro- be deployed at Oak Ridge National pared to the real-world applications cessors, the top 10 positions in the Laboratory. Sequoia is the IBM executed on real systems. Linpack Graph-500 list (www..org) Blue Gene/Q architecture, whereas itself isn’t an all-inclusive performance are systems composed of the most Titan will include GPUs. Blue Wa- measure because it measures the powerful single-chip multicore pro- ters, which is expected to go on- speed of performing floating-point cessors. This shows that the most line at the US National Center for operations—certainly a crucial com- demanding HPC applications still Supercomputing Applications this ponent of most computation—but need high-performance multicore year, will be the first system with ignores system characteristics such processors. peak performance of 10 petaflops.

94 Computing in Science & Engineering

CISE-13-3-Novel.indd 94 05/04/11 6:16 PM Also, NASA’s Pleiades system is ex- 20-petaflops range, coupled with de- to society necessary to justify the pected to be upgraded to reach the creasing transistor size approximately expense. 10-petaflops peak in 2012. Blue Wa- every two years, are likely to be scal- ters in based on IBM’s Power7 archi- able up to 100 petaflops. But moving Volodymyr Kindratenko is a senior re- tecture and Pleiades is an SGI forward past a 100-petaflops range search scientist at the US National Center for ICE architecture. will require developing new ways of Supercomputing Applications at the Uni- Japan’s first 10-petaflops system, computing. versity of Illinois. His research interests in- the Sparc-based K supercomputer, Technologies currently used to clude high-performance computing and will become operational in 2012. produce integrated circuits don’t special-purpose computing architectures. In Europe, a 3-petaflops system is scale in terms of power consump- Kindratenko has a DSc in analytical chem- being constructed in with tion; a 100-petaflops computer built istry from the University of Antwerp. He is expected operation by 2012 as well. with today’s technology will require a senior member of the IEEE and the ACM. The K supercomputer is based on a nuclear power plant to supply the Contact him at [email protected]. the Fujitsu Sparc64 architecture; Europe’s supercomputer, SuperMUC, Moving forward past a 100-petaflops range will require is based on Intel Xeon processors run- ning in IBM iDataPlex serv- developing new ways of computing. ers. The forthcoming Dawning 6000 supercomputer under development in China will provide one petaflops. Although not the fastest system, electricity needed to power and cool Pedro Trancoso is an assistant professor at Dawning 6000 will use Chinese-made it. Some technologies currently under the Department of at the Loongson microprocessor. development—such as integrated University of , Cyprus. His research CPU and GPU processors, large- interests include computer architecture, scale multicore processors, 3D stack- multicore architectures, memory hierarchy, iven current computing technol- ing technology, and very-low-power parallel programming models, database G ogy trends, projections indicate processors—can help. But without workloads, and high-performance comput- that we’ll reach 1 exaflop (one quintil- major breakthroughs in integrated ing. Trancoso has a PhD in computer science lion flops) peak performance by 2019. circuits technologies, we’re heading from the University of Illinois at Urbana- But a growing number of scientists to a time when the cost of building Champaign. He is a member of the IEEE, the are starting to question this trend. and operating increasingly larger IEEE Computer Society, and the ACM. Con- Today’s approaches to achieve 10- to systems won’t provide the payback tact him at [email protected].

Continued from p. 96 To retrace your steps through the technology: software that can popu- full of useful and useless informa- vast column of books, you’d have the late a virtual bookcase from video tion, and populated by people trying option of leaving a bookmark-like footage of a store’s inventory. to make sense of it. Not unlike the tag. And if you wanted help navigat- Internet itself. ing, you could turn to one of several critics, the crowd-sourced opinions he rotating bookshelf in my Charles Day is Physics Today’s online editor. of other readers, or you could ask the T godfather’s living room was one He isn’t a professional GUI designer, but he bookcase to infer your likely prefer- source of inspiration. The other was hopes the bookcase idea catches on. ences from your previous browsing Jorge Luis Borges’ short story “The sessions. Library of Babel,” which first ap- Selected articles and columns from Could used bookstores deploy peared in his 1941 collection The IEEE Computer Society publica- the online browser? Yes, but they’d Garden of Forking Paths. The library tions are also available for free at http:// need an as-yet-uninvented piece of in the fantastical story was infinite, ComputingNow.computer.org.

May/June 2011 95

CISE-13-3-Novel.indd 95 05/04/11 6:16 PM