K-Computer Xinya (Leah) Zhao Abdulahi Abu Outline

• History of Supercomputing • K-Computer Architecture • Programming Environment • Performance Evaluation • What's next? Timeline of Supercomputing

Control Data The Cray era Massive Processing Petaflop Computing Corporation (1960s) (mid-1970s - 1980s) (1990s) (21st century)

• CDC 1604 (1960): First • 80 MHz Cray-1 (1976): • SR2201 (1996): • Realization: Power of solid state The most successful used 2048 processors large number of small • CDC 6600 (1964): 100 in history connected via a fast three processors can be computers were sold at $8 • dimensioanl corssbar harnessed to achieve high million each • Introduced chaining in network performance • Gained speed by "farming which scalar and vector • ASCI Red: mesh-based • IBM Blue Gene out" work to peripheral registers generate interim MIMD massively parallel architecture: trades computing elements, results system with over 9,000 processor speed for low freeing the CPU to • The Cray-2 (1985): No compute nodes and well power consumption, so a process actual data chaning and high memory over 12 terabytes of disk large number of • STAR-100: First to use latency with deep storage processors can be used at vector processing pipelinging • ASCI Red was the first air cooled temperature ever to break through the • (2011) : 1 teraflop barrier fastest in the world K Computer is #1 !!! Why K Computer?

Purpose: • Perform extremely complex mathematical or scientific calculations (ex: modelling the changes in climate over millions of years) • Areas of application: manufacturing, aerospace, biotechnology and disaster- prevention

K Computer Initiative: • Part of the High-Performance Computing Infrastructure (HPCI) initiative led by Japan's Ministry of Education, Culture, Sports, Science, and Technology (MEXT) • Developed by RIKEN and Fujitsu • Completed in August 2011 Specs CPU Specs

• SPARC64 VIIIFX (Venus) o Reuses part of SPARC VII architecture • 8 cores at 2.0GHz • L1 Cache=32Kb, L2 Cache=6MB o Software or hardwar controlled cache through HPC • High performance with low power o 128 GFLOPS o 64 GB/s memory throughput o 58W (TYP, 30°C) • Hardware barrier for fast inter-core sync HPC Extension • HPC-ACE: Fujitsu's unique ISA Extension • Large register sets o 160->192 INT registers, 32->256FP registers (DP) o Extract more parallelism o Reduce spill/fill overhead • SIMD enabled • Software controlled Cache o Goal is to Optimize performance while keeping cache coherency K Computer Packaging

IO System Board

Service Processor Board

System Board Power supply Cooling Tubes

System Disk

"Slanted Implementation" Supports Cooling and High Density ƒ 24 system boards ƒ 6 IO boards System Configuration Network Architecture Tofu: 6D mesh/torus Interconnect Architecutre • High communication performance • High system scalability • High fault-tolerance

Main Architecture Composition • Node Construction • Network Construction • Routing function Why 6 dimension? Node Construction • Single CPU and single interconnect controller o Comput nodes: 80,000 ƒ Number of cores: 640,000 • Memory: 1PB (16 GB/node) • 10 links for inter-node connection • 10GB/s per link • Routing chip structure • No outside switch Network Construction Routing Algorithm

Extended dimension order routing • abc to xyz to abc • The first abc traversal is path selection

Example routing • Routing from (x=0,y=0,z=0,a=0,b=0,c=0) to (3,2,1, 1,1,1) • Traverses: +b+a+3x+2y+z+c Where is the K Computer? Cooling System Computer Floor Software Structure Programing Environment

• Major problems o Too many processors to manipulate o Very little opportunity for coarse grain parallism o Scheduling on processors • Solutions on K computer o VISIMPACT o Automatic parallelization facility makes multi-cores into one high speed core o Exploit fine grain parallism • Open Petascale Libraries o Development platform for application on petascale-class VISIMPACT

• What o Hybrid execution model (MPI + Threading between cores) o Treats 8-core CPU as one high speed CPU • Why o Improve parallel efficiency and reduce memory impact o Make it easier to program on many cores • How o Hardware barriers between cores o Shared L2 Cache o Automatic parallel compiler MPI • Open MPI based o Added extension for "Tofu" interconnect • Goals for MPI on K computer o High performace ƒ High bandwidth o High Scalability ƒ K computer will grow • Dimension specific for each rank Performance (Nov 2010) • Measured results o Max (LinPack) = 48.03 TFLOPs o Peak = 52.22 TFLOPs o Efficiency = 92.0% o Power Consumption = 57.96 KW o Performance per power = 828.7 MFLOPs/W • Ranking o TOP500 = 170th o Green500 = 4th Performance (June 2011)

• Measured results o Max (LinPack) = 10.51 PetaFLOPs o Peak = 11.28 petaFLOPs o Efficiency = 93.2% o Power Consumption = 9.89 MW o Performance per power = 824.6 GFLOPs/kW

• Ranking o TOP500 = 1st o Green500 = 2nd CPU Performance HPC Challenge 2011 K Computer Comparison What’s next?

• SPARC64 Ixfx o 16 cores, 12MB shared L2 cache, and runs at 1.86GHz o Peak performance of 236.5 GFLOPS and will have a power efficiency of more than 2 GFLOPS per watt (115W per chip) • PRIMEHPC FX10 o Follow up to the K computer o 20 petaflop, 6 dimensional torus interconnect, one SPARC processor per node • RIKEN and Fujitsu will focus on developing and assessing system software for the large-scale K computer with the aims of system completion in June 2012 Questions?