Trends in High-Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
N o v E l A r C h I t E C t u r E S Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected] Trends in HigH-Performance comPuTing By Volodymyr Kindratenko and Pedro Trancoso HPC system architectures are shifting from the traditional clusters of homogeneous nodes to clusters of heterogeneous nodes and accelerators. e can infer the future of from traditional clusters of homo- at number six and deployed at the high-performance com- geneous nodes to clusters of hetero- French Atomic and Alternative Ener- W puting (HPC) from the geneous nodes (CPU+GPU). As the gies Commission. technologies developed today to show- sidebar, “Architecture of Current Peta- The Tianhe-1A is composed of case the leadership-class compute flops Systems” describes, three out 7,168 nodes—each containing two systems—that is, the supercomputers. of seven systems that achieved over Intel Xeon X5670 hex-core (West- These machines are usually designed one quadrillion flops (petaflops) on mere) processors and one Nvidia to achieve the highest possible perfor- the standard benchmark tool Linpack Tesla M2050 GPU. The Jaguar is a mance in terms of the number of 64-bit (www.top500.org/project/linpack) are more traditional Cray XT system floating-point operations per second Nvidia GPU-based, and two out of consisting of 18,688 compute nodes (flops). Their architecture has evolved the three largest systems are deployed containing dual hex-core AMD Op- from early custom design systems to in China. teron 2435 (Istanbul) processors. The the current clusters of commodity The number one system on the highest-performing GPU-based US- multisocket, multicore systems. Twice November 2010 Top-500 list is Tianhe- built system, deployed at Lawrence a year, the supercomputing commu- 1A, deployed at the National Su- Livermore National Laboratory, is nity ranks the systems and produces percomputing Center in Tianjin. It number 72 on the Top-500 list. We’ve the Top-500 list (www.top500.org), achieves about 2.57 petaflops on Lin- yet to see any substantially larger which shows the world’s 500 high- pack with a theoretical peak perfor- GPU-based HPC systems deployed in est performing machines. As we now mance of 4.7 petaflops. The number the US. Instead, we see many research describe, the technologies used in the three system is Nebulae, deployed at centers deploying small and mid- top-ranked machines give a good indi- the National Supercomputing Centre range GPU systems, such as the US cation of the architecture trends. in Shenzhen. It achieves 1.27 petaflops National Science Foundation (NSF) on the Linpack benchmark with a Track 2D Keeneland system, which The Top-500 theoretical peak of almost 3 petaflops. is in its initial deployment phase at The November 2010 Top-500 list The highest performing US system, the Georgia Institute of Technol- of the world’s most powerful super- Jaguar, deployed at the Oak Ridge ogy (number 117 on the Top-500 computers stressed two noticeable National Laboratory, is number two list), or the Forge GPU HPC cluster developments in HPC: the advent of on the Top-500 list, achieving 1.75 currently being developed at the US HPC clusters based on graphical pro- petaflops with theoretical perfor- National Center for Supercomput- cessing units (GPUs) as the dominant mance of 2.3 petaflops. The number ing Applications at the University of petascale architecture, and the rise of four system on the November 2010 Illinois. The Keeneland project in China as a dominant player in the su- Top-500 list is the Japan-built Tsub- particular is interesting in that a part percomputing arena. ame 2.0 system, which is a GPU-based of the effort is devoted to developing To build more powerful machines, system. Europe’s most powerful sys- the software infrastructure neces- system architects are moving away tem is the Tera-100, which is ranked sary to utilize the GPUs in an HPC 92 Copublished by the IEEE CS and the AIP 1521-9615/11/$26.00 © 2011 IEEE Computing in SCienCe & engineering CISE-13-3-Novel.indd 92 05/04/11 6:16 PM Architecture of current benchmark with a theoretical peak of about 2.3 peta- flops. It consists of 1,442 compute nodes containing PetAfloPs systems (mostly) six-core Intel Xeon X5670 (Westmere) CPus, or 73,278 processor cores in total, and three Nvidia he November 2010 top-500 list included seven tesla M2050 GPus per node. the system used Quad petaflops systems: t Data rate Infiniband interconnect. 1. tianhe-1A, deployed at the National Supercomputing 5. hopper, deployed at the uS National Energy research Center in tianjin, China, achieves about 2.57 petaflops Scientific Computing Center, achieves just over a peta- on linpack with a theoretical peak performance of flops on linpack, with a theoretical peak of almost 1.3 4.7 petaflops. tianhe-1A is composed of 7,168 nodes, petaflops.h opper is a Cray XE6 system consisting of each containing two Intel Xeon EM64 X5670 six-core 6,384 nodes containing 12-core AMD operon proces- (Westmere) processors and one Nvidia tesla M2050 sors, or 153,216 processor cores in total. the system GPu. the system uses a proprietary interconnect. uses a custom interconnect. 2. Jaguar, deployed at the oak ridge National labora- 6. tera-100, deployed at the Alternative Energies and tory, uS, achieves 1.75 petaflops with theoretical per- Atomic Energy Commission, France, achieves 1.05 formance of 2.3 petaflops. Jaguar is a Cray Xt system petaflops onl inpack with a theoretical peak of 1.25 consisting of 18,688 compute nodes containing dual petaflops.t he system consists of 4,300 bullx S series six-core AMD opteron 2435 (Istanbul) processors, server nodes based on eight-core Intel EM64 Xeon or 224,256 processor cores total. the system uses a 7500 (Nehalem) processors, or 138,368 proces- proprietary interconnect. sor cores total. the system used Quad Data rate 3. Nebulae, deployed at the National Supercomputing Infiniband interconnect. Centre in Shenzhen, China, achieves 1.27 petaflops 7. roadrunner, deployed at los Alamos National labora- on the linpack benchmark with a theoretical peak tory, uS, was the first supercomputer in the world of nearly 3 petaflops. Nebulae is composed of about to achieve over one petaflops. It currently stands at 4,700 compute nodes containing Intel EM64 Xeon six- 1.04 petaflops onl inpack, with theoretical peak of core X5650 (Westmere) processors, or 120,640 pro- 1.38 petaflops. Its main workforce is a nine-core IBM cessor cores total, and Nvidia tesla C2050 GPu. the PowerXCell 8i experimental chip. the system consists system used Quad Data rate Infiniband interconnect. of 12,960 PowerXCell 8is and 6,480 dual-core AMD 4. tsubame 2.0, deployed at the tokyo Institute of tech- opterons, or 122,400 processor cores total, and uses nology, Japan, achieves 1.19 petaflops on the linpack voltaire Infiniband interconnect. environment. It’s also NSF’s first 1.75/2.3 = 0.76, or 38 percent above The Green and Graph Lists system funded under the Track 2 ex- Tianhe-1A. The difference is even As the “Metrics and Benchmarks” perimental/innovative design program. more pronounced when considering sidebar describes, organizations use The Chinese HPC community’s real scientific workloads. various metrics and applications approach to designing high-end sys- Also, software availability for large- to evaluate computing systems. In tems is certainly worth note. In terms scale GPU-based systems is limited. the past, supercomputers were de- of raw performance, adding a GPU to While numerous applications have veloped for the single highest per- a conventional HPC cluster node can been ported to GPU-based systems formance goal (as measured by quadruple its peak performance, or over the past two years, many of the Linpack), but current system devel- even increase it by an order of mag- widely used scientific supercomputing opment is driven by two additional nitude when using 32-bit arithmetic. codes that have been developed over factors: power consumption and But the increased peak performance the past 20 years have yet to be rewrit- complex applications. Traditional doesn’t necessarily translate into sus- ten for GPUs. Many such applications approaches to achieving perfor- tained application performance. As have outlived several generations of mance have reached an unbearable an example, take Tianhe-1A, with computer architectures; it would be power consumption cost. As such, sustained 2.57 petaflops on Linpack irrational to scrap all the prior work the community has proposed a sec- and theoretical peak performance of each time a new and drastically dif- ond list—the Green-500 list (www. 4.7 petaflops. Its efficiency in terms of ferent architecture is presented, even green500.org)—that ranks systems sustained versus peak performance is if its performance is significantly according to their performance/ 2.57/4.7 = 0.55. Jaguar, on the other higher. Not surprising, we have yet power metric. The Green-500 list hand, achieves 1.75 petaflops on Lin- to see any applications that can take sorts systems from the Top-500 list by pack with theoretical performance of advantage of Tianhe-1A’s computing their power efficiency; it also shows 2.3 petaflops. Thus, its efficiency is power. the trend toward heterogeneous may/June 2011 93 CISE-13-3-Novel.indd 93 05/04/11 6:16 PM N o v E l A r C h I t E C t u r E S metrics And BenchmArks power consumption for executing the benchmark on the system. It thus uses flops per watt (total power consump- upercomputer systems are evaluated according to tion for the program’s execution).